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Preface 



This workshop is the tenth in a series of international workshops. This year 
it takes place in Gottingen, Germany, and is organized by the University of 
Hannover. 

Gottingen has one the most famous German universities, where very well 
known scientists like Lichtenberg, Hilbert, Gauss and von Neumann studied, 
worked and taught. It also hosts several research institutes of the Max-Planck- 
Society. The first electronic tube calculator G1 was built in Gottingen in 1952 
by H. Billing. Additionally, Gottingen was selected because it is adjacent to the 
world exposition EXPO 2000 in Hannover which gives an outlook into the 21st 
century covering the major topics of humankind, nature and technology. 

With respect to these inspiring surroundings the technical program of PAT- 
MOS 2000 includes 10 sessions dedicated to most important subjects of power 
and timing modeling, optimization and simulation at the dawn of the 21st cen- 
tury. 

The four invited talks address the European research activities in the work- 
shop fields, the evolving needs for minimal power consumption in the area of 
wireless and chipcard applications and design methodologies of very highly in- 
tegrated multimedia processors. 

The workshop is a result of the joint work of a large number of individuals, 
who cannot all be mentioned here. In particular, we would like to acknowledge 
the outstanding work of the reviewers, who did a competent job in a timely 
manner. We also have to thank the members of the local organizing committee 
for their effort in enabling the conference to run smoothly. Finally, we gratefully 
acknowledge the support of all organizations and institutions sponsoring the 
conference. 
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Constraints, Hurdles and Opportunities 
for a Successful European Take-Up Action 



Rene van Leuken, Reinder Nouta, and Alexander de Graaf 

DIMES ESD-LPD, Delft University of Technology 
Mekelweg 4, H16 CAS, 2628 CD Delft, The Netherlands 
esdlpdSdimes . tudelf t . nl, http : / /www . esdlpd . dimes . tudelf t . nl 



Abstract. ’’...Knowledge management is now becoming the foundation 
of new business theory and corporate growth for the next millennium. 
The key difference is that it’s about networking people not simply pro- 
cesses and PCs...” [1]. 



1 Introduction 

Low power design became crucial with the wide spread of portable information 
and communication terminals, where a small battery has to last for a long period. 
High performance electronics, in addition, suffers from a permanent increase 
of the dissipated power per square millimetre of silicon, due to the increasing 
clock-rates, which causes cooling and reliability problems or otherwise limits the 
performance. 

The European Union’s Information Technologies Programme ’Esprit’ did 
therefore launch a ’Pilot action for Low Power Design’, which eventually grew 
to 19 R&D projects and one coordination project, with an overall budget of 14 
million EURO. It is meanwhile known as European Low Power Initiative for Elec- 
tronic System Design (ESD-LPD) and will be completed by the end of 2001. It 
involves 30 major European companies and 20 well-known institutes. The R&D 
projects aims to develop or demonstrate new design methods for power reduc- 
tion, while the coordination project takes care that the methods, experiences 
and results are properly documented and publicised. 

2 European Low Power Initiative 
for Electronic System Design 

The initiative addresses low power design at various levels. This includes system 
and algorithmic level, instruction set processor level, custom processor level, 
RT-level, gate level, circuit level and layout level. It covers data dominated and 
control dominated as well as asynchronous architectures. 10 projects deal mainly 
with digital, 7 with analogue and mixed-signal, and 2 with software related 
aspects. The principal application areas are communication, medical equipment 
and e-commerce devices. 
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Instead of running a number of Esprit projects at the same time inde- 
pendently of each other, during this pilot action the projects have collabo- 
rated strongly. This is achieved mostly by the novelty of this action, which 
is the presence and role of the coordinator: DIMES - the Delft Institute of 
Microelectronics and Submicron-technology, located in Delft, the Netherlands 
(http://www.dimes.tudelft.nl). The task of the coordinator is to co-ordinate, 
facilitate, and organize: 

— The information exchange between projects. 

— The systematic documentation of methods and experiences. 

— The publication and the wider dissemination to the public. 

3 Constraints, Hnrdles and Opportunities 

The initiative has been running now for about 3 years. Roughly we can distin- 
guish the next phases: 

1. Selection and negotiation phase. Start: 1997. Duration: 6 months. 

2. Legal activities, contracts etc. Start: 1997. Duration: 18 months. 

3. Start of the initiative and design projects. Start: 1998. Duration: 12 months. 

4. Tracking of design project results. Start: 1999. Duration: 18 months and 
continuing. 

5. Start dissemination activities. Start: 1999. Duration: 18 months and contin- 
uing 

6. Financial administration. Start: 1999. Duration: 18 months and continuing. 
Here are some statistics: 

1. Number of Associated Contracts: about 60. 

2. Number of issued task contracts: about 30. 

3. Number of contract amendments: 5 (more planned). 

4. Number of contract type changes: 2 

5. Number of appendixes of each progress reports: we stopped counting, we 
ship them in a box. 

6. Number of projects on time with deliverables: none. 

7. Number of available public deliverables on our web site: about 50. 

8. Number planned low power design books: 6. The first has been published. 

During the session we will present the audience a number of thesises ( 5 
to 7). Each thesis will a address a topic , for example: ’’All public deliverables 
should be written using a defined design document standard”, or ’’There is no 
knowledge dissemination problem; Only the lack of people is a problem” , we will 
be present to you some historic events, feedback from partners and reviewers. 
Thereafter we will discuss the thesis with people from the audience and see if 
we can get some sort of statement which expresses the opinion of the audience. 

References 

1. C. S. of Management. The Cranfield and Information Strategy Knowledge Survey. 
November 1998. 




Architectural Design Space Exploration Achieved 
through Innovative RTL Power Estimation Techniques 

Manuela Anton, Mauro Chinosi, Daniele Sirtori, and Roberto Zafalon 

STMicroelectronics, 1-20041 Agrate B. (MI), Italy 



Abstract. Today’s design community need tools that address early power estima- 
tion, making it possible to find the optimal design trade-offs without respinning 
to explore the whole chip. 

Several approaches based on a fast (coarse) logic synthesis step, in order to 
analyze power on the mapped gate-level netlist and then create suitable power 
models have been published in the last years. 

In this paper we present some applications of RTPow, a proprietary tool 
dealing with the RT-level power estimation. The innovative estimation engine 
that does not perform any type of on-the-fly logic synthesis, but analyze the 
HDL description from the functionality point of view, permits a drastic time sav- 
ing. Besides this top-down estimation, RTPow is able to perform a series of 
power macromodels and the bottom-up approach that enable an effective power 
budgeting. The first is an Adaptive Gaussian Noise Filter (28K Eq.Gate), 
described in VHDL, the second is a Motion Estimation and Compensation 
Device for Video Eield Rate Doubling Application (171K Eq.Gate) also 
described in VHDL. The third is a micro-processor core (11 IK Eq.Gate) 
described using Verilog language. 



1 Introduction 

The increasing use of portable computing and communication systems makes power 
dissipation a critical parameter to be minimized during circuit and system design. 

Low power design needs efficient and accurate estimation tools at all design 
abstraction levels. In particular, RT-level power estimation is critical in obtaining short 
design times and is very important to help the designer in making the right architec- 
tural choices. 

Nowadays a crucial request is design turnaround time. Allowing the architectural 
exploration and “what-if ’ analysis before logic synthesis, the complex design trade- 
offs can result in a faster time-to-market. Accurate RT-level power estimation allows to 
reduce the number of design iterations and their cost, making the power budgeting eas- 
ier. 

The approaches proposed in literature (see [1] for a comprehensive survey) can be 
categorized into two main classes: top-down and bottom-up methods. While the former 
class is particularly suited for components with a fixed structure and/or design (e.g., 
memories, data-path units), the hottom-up methods are based on the idea of building 
an abstract power model by refining an initial model template through experimental 
power measurements. In the bottom-up approach, the estimated power is given hy a 
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relation between the significant characteristics of the input stimuli and their weight 
coefficients determined from power characterization (e.g.: by means of a linear regres- 
sion or look-up table), performed at a lower level of abstraction and used to match the 
behavior of the analyzed block. 

The power macromodeling techniques [2], [3], [4], [5], [6] can be differentiated by 
considering the kind of power data they can actually provide: some methods allow a 
cycle-accurate estimation, while others can manage just the total average power. 

RTPow is a dynamic power estimation proprietary tool that operates at RT level 
and is embedded into the Synopsys Design Environment. It is able to arbitrarily apply 
both top-down and bottom-up analysis modeling techniques in any combination, to 
analyze generic sparse logic on one side, and pre-characterized macros and IP’s, on the 
other. In addition, it is able to manage different macromodeling strategies (i.e.: table or 
regression based) and to take advantages of any macromodel available feature (i.e.: 
cycle-accurate or cumulative power figures). 

The objective of this paper is to validate RTPow capabilities on several industrial 
applications and to benchmark the results with the corresponding power values 
obtained at a lower level of abstraction, by means of a gate level reference power sim- 
ulator (e.g.: Synopsys DesignPower [12]). Another type of monitoring has been done 
on the actual computer resources requirements, such as the total CPU time and main 
memory allocated during the data processing (RT-level and gate-level). 



2 RTPow Functionality 

RTPow is an RT-level power estimation tool that works within the Synopsys’ Design- 
Compiler environment [7], as shown in Figure 1. 




Fig. 1. Conceptual Flow of RTPow 
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The RTPow software architecture consists of a set of scripts, driven by certain user- 
specified variables, that outputs information about the current design functionality and 
structure on a set of hies parsed and elaborated by an underlaying C-H- main engine. 
The design functionality and structure are inherited by RTPow from the design data- 
base; therefore, the design can be written in any HDL supported by the Synopsys 
HDL-Compiler (e.g.: VHDL, Verilog) and needs to be previously de-composed into 
Synopsys generic objects, by running the traditional analyze and elaborate steps [11]. 
As a matter of fact, the non-necessity of going through the quite expensive logic syn- 
thesis flow need to be emphasized as this feature will enable a true design exploration, 
by allowing the designer to have a fast estimation and to make an easier hnding of the 
power-optimal architecture. 

The power estimator may work in two ways. 

1 . The hrst one is the simulation mode. In this mode it operates as a co-simulator by 
using an embedded cycle-accurate internal simulation engine. The user needs to 
specify the test pattern hie which can be obtained either from a formerly written 
RTL testbench or by providing it on the hy. Provided that all macromodels are 
cycle-accurate, RTPow can compute a cycle-based energy report, as well as the 
energy peak and the simulation cycle when that energy peak has been accounted 
(see Figure 4 as an example). Basically, the cycle-by-cycle plot of energy consump- 
tion is useful to identify the operating modes of sequential machines or power 
peaks related to some specihc activity burst who might be more suitable to opti- 
mize. In addition, a detailed power log structured on design blocks, down through 
the hierarchy, is also available. The reported dynamic power is splitted among net 
and internal power, as those concepts have been widely adopted by the industrial 
design community. Once the simulation is over, a hie which contains switching 
activity and static probability information about synthesis invariant components 
(i.e., I/O ports, sub-module boundaries and sequential cell outputs) is written. This 
hie has a “.saif’ extension (Switching Activity Interchange Format) and can be 
effectively used either to annotate the switching activity onto the RTL design and 
then providing an appropriate forward annotation to drive the power-driven synthe- 
sis with PowerCompiler or running RTPow in static mode (see below). 

2. The second mode of operation is static (or probabilistic). In order to achieve a 
higher accuracy, this mode formerly requires a RTL design node annotation (i.e.: 
on the synthesis invariant ports), especially on the sequential cell outputs, since the 
switching activity propagation engine tends to suddenly loose accuracy when deal- 
ing with sequential cells (this is primarily due to the adopted BDD representation 
of the design functionality, who has an intrinsic limitation in both manageable 
design size and toggle rate propagation). Node annotation may be either performed 
by reading an RTL “.saif” hie (e.g.: from a previous run of RTPow in internal sim- 
ulation mode or from an external RTL simulation [12]) or by providing a list of 
set_switching_activity commands on the appropriate ports mentioned above. 
However, should the node annotation being unfeasible in the current design under 
analysis, RTPow is dehnitely able to operate in a pure probabilistic mode, without 
any external information. In such a case, the signihcant switching characteristics at 
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the primary inputs are set to a predefined default value and these values are propa- 
gated in circuit, down through the hierarchy, with the aid of the embedded switch- 
ing activity propagation engine (integrated in RTPow). 

Eventually, RTPow will provide the total average power figure for each hierarchical 
block. Of course, since no input test pattern are involved during the static estima- 
tion mode, the cycle-based power log will be not available. 



3 More on RTPow Functionality 

RTPow is based on circuit functionality and design structure complexity exploration. 
Some former works ([8], [9], [10]) have proposed to adopt an initial design description 
from which the circuit topology is directly imported in the estimator by extracting an 
equivalent functional representations (typically by means of BDDs - Binary Decision 
Diagrams). 

Given the context of RTPow, we wanted an effective method to get the design data 
base directly from the EDA environment used to analyze the design. After source code 
analysis and elaboration, a RT-level description is expressed within DesignCompiler as 
an interconnection of different types of primitives (e.g., Gtech ports, Generic logic 
blocks. Synthetic operators. Design Ware modules. Generic sequential operators) [11]. 
The difficulty here is to make this type of description available to the underlying C-H- 
analysis engine. The task is carried out in RTPow by dumping the circuit as a set of 
equations. Then, connections between the previous components and sequential cells 
are recognized by parsing the file produced by the report_cell command and by includ- 
ing their functionality into the previously created representation. 

The top-down approach in RTPow investigates the circuit topology and extracts, 
from the Synopsys representation before mapping (technology independent compo- 
nents from Synopsys’ generic library interconnected with clusters of combinational 
logic, at their turn expressed by the same Synopsys’ generic library components) infor- 
mation readable by the underlaying estimation engine. The generic combinational 
functionality is represented as a BDDs structure. This is equivalent to a 2-to-l MUX 
mapping (see Eigure 2), with input signals connected to the selection pin of each 
MUX. Being a possible library mapping, the area estimation fits the number of BDDs 
nodes, as a measure for the area occupancy, to the actual area measured on a number of 
benchmarks mapped onto the target ASIC library. The area allocated by such a BDDs 
mapping is therefore approximated by the number of fitted BDDs, by means of linear 
regression, to the actual area obtained on a large number of benchmark circuits imple- 
mented with the desired target technology. 

Area estimation is then used in power modeling as an approximation of module’s 
capacitance. Therefore, to get the power values, both area and average switching activ- 
ity estimates are needed. Switching activity is estimated on each virtual net of the 
equivalent MUX mapping, either in simulation mode or in probabilistic mode, but the 
estimation is done differently: while simulation mode simply counts the number of 
transitions provided by a set of input patterns, probabilistic mode uses statistical rela- 
tion to get the toggle rate on each virtual net of the equivalent MUX mapping. 
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The result is tuned to the target technology and to the actual synthesis policy the 
designer is adopting (e.g.: timing-driven, area-driven or power-driven) by means of 
area and capacitance scaling factors obtained, once and for all, by characterizing the 
correlation between the MUX-based design and the actual technology mapping on a 
large number of pre-dehned industrial designs. 

Power consumed both on nets and inside combinational cells is processed and 
reported separately. 

A more accurate estimation can be obtained by using dynamically linked power 
macromodels. In the design flow, intellectual property blocks reuse is a key factor to 
match the required time-to-market. Macromodeling implies an unique power charac- 
terization step, made once and for all and provides power related information to per- 
form a fast and accurate estimation. These IPs can be either soft or hard macros seen as 
black boxes. In this category fall all DesignWare modules as well as every IP block for 
whom a macromodel is available. 

Each time a block in the design hierarchy is considered, RTPow attempts to run in 
bottom-up approach and first checks if a power macromodel does exist in the macro 
library and only if it doesn’t find one it will use the top-down approach. 

RTPow is independent of the algorithm used by macromodels to estimate the 
required parameters, i.e., different implementations can be used for a variety of cir- 
cuits. For example, if the analyzed macro is a whole processor, an instruction-based 
estimation could be the most efficient solution, while a look-up table could be good for 
a bench of control logic. 

The macromodel details are hidden to RTPow, in fact the macromodel is built as an 
external dynamically linked library, implementing a given interface. 

The process of macro-block characterization and model building has been fully 
automated for the table based macromodel (see below), proven to be the most accurate 
and robust solution. This method provides a look-up table addressed by some concise 
form of instance’s sensitization (e.g., input switching activity) and retains the property 
of being automatically extracted and general. Moreover, it is robust because look-up 
table can represent any function with a desired accuracy, provided that the table can be 
made arbitrarily large. 

The automatic table-based macromodel building requires only the mapped imple- 
mentation on the RTPow reference library of the IP block, no need to re-characterize 
for different technology target. 
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Semi-automatic approaches have been developed for other techniques (e.g.: poly- 
nomial regression), where the dynamic library construction is not yet automatic. 



4 RT-Level Library Characterization 

RTPow is able to evaluate the overall power performances of the technology cell 
library and in this way it is taking into account, during estimation, the target library 
specifications. 

During library characterization a set of industrial RTL benchmarks are mapped to 
the gate level and the corresponding statistical area occupation and power consumption 
values are inferred. These information are used by RTPow during its top-down estima- 
tion capabilities and to compute a number of technology scaling factors to be passed to 
power macro models. 

The process of library characterization is straightforward and fully automatized. A 
user-friendly interface allows the user to run the task anytime on his site. 

In order to address the synthesis based on design constraints, the library can be 
characterized and its high-level parameters can be computed in relation to different 
design strategies such as minimum area or maximum speed synthesis, and an average 
input slope can be specihed to improve accuracy. 



5 Test Cases and Experimental Results 

To evaluate RTPow performances and capabilities, we have chosen three industrial 
designs. 

The first one is an Adaptive Gaussian Noise Filter (GNR from now on), written in 
VHDL, the second one is a Motion Estimation and Compensation Device for Video 
Field Rate Doubling Application (50Hz_to_100Hz) also described in VHDL. The third 
design is a Core Processor (uP) described using Verilog language. 

The gate implementations of the designs listed above contain the equivalent num- 
ber of gates reported in Table 1. 



Table 1. 



Design 


Eq. Gate 


UiNK 




MlHz to lUUHz 


TTTK 


uP 


LfTK 



5.1 Adaptive Gaussian Filter 

GNR is an adaptive intra-held spatial hltering for Gaussian Noise reduction, based on 
recursive estimation of the noise level. GNR realizes a Pre-Processing hlter to improve 
input video sources quality in order to 
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• reduce the amount of Gaussian Noise in the image, mainly in its High Spatial 
Frequency (HSF) components; 

• reduce the HSF components of the video signal in all areas where they are 
unperceived by the Human Visual System 

The applied Pre-Processing is an adaptive low pass filtering. It improves the sub- 
jective quality by reducing the Gaussian Noise. 

The GNR receives in input three different sets of signals each one related to differ- 
ent information: synchronization, image and filtering. 

5.2 Motion Estimation and Compensation Device for Video Field 
Rate Doubling Application 

GNR is part of a larger ST project, named Motion Estimation and Compensation 
Device for Video Field Rate Doubling Application (50Hz_to_100Hz). 

50Hz_to_100Hz is a new device for field rate doubling based on a motion-compen- 
sation algorithm, where motion information are estimated before their compensation in 
the final interpolation process. The motion estimation process is based on a recursive 
block-matching technique. The GNR is introduced as a pre-processing filter to 
improve the quality of the real motion estimation content.The market introduction of 
high-end TV sets, based on lOOHZ CRTs, required the development of reliable field 
rate upconversation techniques, to remove artifacts such as large area and line flicker. 

5.3 Core Processor (uP) 

The core consist of one or more basic execution units (named clusters), an instruction 
fetch unit and a memory interface (the core memory controller). An uP cluster consist 
of one more arithmetic units, a register file and an interface to the core memory con- 
troller. The design used for experiments contains a single cluster. 

5.4 Results 

In order to validate the power estimations determined at architectural level by RTPow, 
each design has been synthesized with a standard cell library realized in 0.25pm 
CMOS technology and the actual power consumption has been estimated using 
DesignPower. It is important to highlight how RTPow adopts table-based power mac- 
romodels also when dealing with the Design Ware modules of the Synopsys generic 
objects representation, (i.e., each Design Ware block has been previously mapped and 
characterized, and a table-based energy model has been generated, thus making the 
power estimation process faster). 

Figure 3 (a) compares pre-synthesis RTL power estimation values resulting from 
RTPow to the post-synthesis power figures coming from DesignPower, for the first two 
circuits (4.28% difference). 

In Figure 3 (b), the processor core (uP) has been tested under two realistic patterns 
of input stimuli traced from a high level system simulation. The results of RTPow and 
DesignPower are, then, reported. The first bench of power values address the case 
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when the processor is not executing any operation (i.e.: a sequence of NOP’s) while 
the other refer to the case when the processor is executing two additions simulta- 
neously (2 ADD’s). For both these applications, the average power consumption is 
plotted against the processor stalling percentage (on the X axis), mainly due to cache 
memory misses. 

Regarding the processor core, it may be noticed that, although RTPow is overesti- 
mating the absolute power values, the relative power figures predicted at RTL under a 
wide range of stalling probability show a quite close accordance with the related power 
performance reported at gate level (always assumed as a reference). 

Indeed, the processor core is representing an extreme test case for the architectural 
estimation since it is strongly based on one only clock domain who is driving all the 
sequential cells (i.e.: FFs and Latches) of the deeply pipelined internal architecture. As 
of today, the physical implementation of those kind of heavily loaded networks is usu- 
ally managed by a set of appropriate clock tree synthesis techniques, whose major goal 
is the optimal placement and routing of these high fan-out and hierarchical networks in 
order to meet the severe design constraints on the max delay and max skew between 
the root and each leaf cells of the interconnection tree, respectively. 

While the implication, in terms of power performance, of the clock tree physical 
implementation is fully tractable at gate level (provided a consistent post-layout back 
annotation), the prediction of such a structure during the architectural estimation is 
extremely haphazard and still lacking of a general solution. 

In our specific design, while the RTPow’s analysis of the clock tree is based on the 
estimation of the switching energy associated to an equivalent global network with a 
given fan-out (easily exceeding 10000 leaf cells), the actual implementation of this 
network is a hierarchical and balanced tree of buffers, necessary to meet the global 
timing constraints, including the avoidance of any slope degradation. As a matter of 
fact, the overestimation of the clock’s switching energy is due to the large slope degra- 
dation induced by assuming an equivalent global network driving an extremely large 
fan-out. 

Our future works we intend to address the development of a robust and more suit- 
able prediction model of those physical structures. 

Almost all works addressing RTL power estimation are focused on power models 
accuracy. Power models on their own are strictly dependent on the evaluation condi- 
tions. In a real world, for industrial designs, tuning all characteristics involved in esti- 
mation to the actual functional conditions is recognized to be quite hard. 

Certainly the goal of an RTL estimator is not to provide sign-off power values but 
rather to allow designers in exploring, evaluating, comparing and eventually optimiz- 
ing different architectures using various components and IP blocks, choosing the best 
candidate for a minimal power consumption. The RTL estimations inherently highlight 
the “hot” issues, the architectures that should be modified or substituted in order to 
minimize the overall device power consumption (see [8] and [7] for a survey of RTPow 
design exploration capabilities). 

In order to obtain a substantial increasing in absolute accuracy, the estimator fea- 
tures would need a better matching with real conditions. This issue can’t be solved 
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without huge time investment and subsequent severe impact to the demanding time-to- 
market. 
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Fig. 3. Power Estimation on three industrial test cases- Absolute Power 



Provided that all macromodels are cycle-accurate, RTPow can output, when run- 
ning in simulation mode, a cycle-based energy report, as well as the energy peak and 
the simulation time when that energy peak has been registered. Figure 4 illustrates the 
energy behavior for the GNR, when it has been simulated for a period of 223000ns. 
The reported energy peak is 1 .04048 uJ and the corresponding simulation time when it 
has been obtained is 882 ns. 



inergy CuJ] Cycle-based Energy for GNR 




Fig. 4. Cycle-based Energy for GNR 
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As we mentioned in section 3, the top-down approach in RTPow investigates the 
circuit topology and extracts, from the Synopsys representation before mapping (tech- 
nology independent components from Synopsys’ generic library interconnected with 
clusters of combinational logic, at their turn expressed by the same Synopsys’ generic 
library components) information readable by the underlaying estimation engine. All 
this information are stored by RTPow in an internal database that could be used in case 
the input stimuli is changed (and the circuit structure is not modihed), with an impor- 
tant amount of time saving. Table 2 reports the CPU and memory involved by RTPow 
(RTP) and by DesignPower (DP) during estimation, for all three circuits. We can 
observe the high speed of RTPow estimation process, when building the own database 
(column 1), and when the database is available (incremental mode reported in column 

3). 



Table 2. 



Design 


RTP 

CPU 

(s) 


RTP 

mem 

(kB) 


RTP CPU 
database 
ready(s) 


RTP mem 
database 
ready(kB) 


DP 

CPU 

(s) 


DP 

mem 

(kB) 
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6 Conclusions 

We have presented RTPow capabilities on several industrial applications and we 
have compared the results with the corresponding power values obtained at a lower 
level of optimization. The RTPow behavior on real industrial designs as well as the 
results obtained justified us to assert that RTPow is an effective tool for power design 
exploration, suitable to be integrated into an existing industrial design flow as it allows 
the designer to quickly evaluate the “what-if ’ possibilities and to choose the best cir- 
cuit architecture for a power-conscious design in a pre-synthesis environment. 
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Abstract. Most power macromodels for RTL datapath modules are 
both data- dependent and activity- sensitive, that is, they model power in 
terms of some aetivity measure of the data inputs of the module. These 
models have proved to be quite accurate for most combinational RTL 
datapath macros (such as adders and multipliers), as well as for storage 
units (such as registers). They tend to become inadequate for RTL mod- 
ules that are eontrol- dominated, that is, having a set of control inputs 
that exercise different operational behaviors. Furthermore, some of these 
behaviors may be input-insensitive, that is, they let the module evolve 
(and thus consume power) in a semi-autonomous way, independently of 
the input activity. We propose a procedure for the construction of ad-hoc 
power models for semi-autonomous RTL macros. Our approach is based 
on the analysis of the functional effect of such control inputs on specific 
macros. Although the resulting models are tailored to individual macros, 
the model construction procedure keeps the desirable property of being 
automatic. 



1 Introduction 

Most approaches to high-level power estimation specifically target RTL estima- 
tion by building abstract power models for the various datapath modules (for a 
comprehensive survey, see [1,2]). Some of these models [3,4,5, 6] may be param- 
eterized with respect to the bit-width of the input data, so that a base model 
can be scaled according to specific, macro-dependent factors, thus avoiding the 
characterization of a macro for any possible value of the bit- width size. 

Power macromodels are usually built for either combinational RTL modules 
(such as adders or multipliers), or for storage units (such as registers or register 
files) with relatively simple I/O behavior. These types of modules share the 
property of being data- dominated, that is, their power is strongly correlated 
with the activity profile of the input data. The corresponding power models thus 
relate power to statistical properties of the data inputs. For instance, a widely 
used power model includes an average measure of the input/output switching 
activity and of the input probability [7,8, 9, 6]. Average is computed with respect 

* This work was supported, in part, by the EC under grant n. 27696 “PEOPLE”. 
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to the size of the input/output data. The rationale behind this averaging process 
is that data inputs have a meaning as whole, and a single quantity is enough to 
characterize them. 

There are other classes of macros, however, for which these types of models 
may result in significant estimation errors. This is the case of control- dominated 
macros, i.e., macros having a set of control inputs that bring it into totally 
different operational modes. In addition, some of these modes can be input- 
insensitive, i.e., the corresponding behavior of the module tends to be totally 
autonomous. When the macro exhibits such behavior, the traditional “activity- 
sensitive” model (following the terminology of [4]) becomes inadequate. We call 
these types of macros semi-autonomous, to emphasize the possible insensitivity 
to the input activity. 

A typical example of semi-autonomous macros is a counter with enable or 
load control signals. If counting is enabled, the counter will actually switch in 
every clock cycle, in spite of the fact that no switching on the data inputs 
happens. While it is true that the clock input can be used to track the switching 
due to counting, it is also true that models that use average switching measures 
as parameters will hide clock switching inside the average. Furthermore, most 
models are black-box, so they do not exploit module-specific information such as 
the semantics of the input signals. Conversely, if the load input is asserted, the 
counter will switch into a input-sensitive behavior, since the stored value will 
determine the amount of switching in that clock cycle. 

Although the literature on power modeling is vast, the issue of multi-mode, 
semi-autonomous macros has not been investigated thoroughly. In some applica- 
tions, however, the power impact of such RTL modules (counters but also shift 
registers) can be sizable. Designs requiring timeouts, or signal processing appli- 
cations usually exhibit several instances of such macros. Resorting to traditional 
black-box models may consequently impair the accuracy of the power estimator. 

In this work, we propose a procedure for the construction of ad-hoc power 
models for semi-autonomous RTL macros. Our approach is based on the analysis 
of the functional effect of the control inputs on specific macros. This does not 
simply imply using a straightforward modification of a black-box model, where 
the control signals (and thus their statistical properties) are explicitly ’’exposed” 
in the model as individual paramters. This is, for example, the approach followed 
in [12], where control signals are used to split the basic model into a set of sub- 
models (a regression tree, in their terminologu) , one for each possible assignment 
of the control signals. 

In our case, the model is a single equation, whose form is derived from the 
inspection of the functional description of the macro. The result is a model which 
is generally non-linear, because some higher-order terms are used to express the 
joint effect of some parameters are properly taken into account. 

We emphasize that the proposed models are not black-box, because they 
exploit specific functional and behavioral information about the macro. The dis- 
tinction between data and control inputs is the minimum information required. 
However, this information can be approximately recovered by simulation, by 
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“measuring” the sensitivity of the outputs to the individual input signals. Tech- 
niques like those of [11,6,13] can be used for that purpose. In that case, although 
with a lower level of confidence, the model can be used as a black-box one. 

Regardless of how the functional information is provided, the construction of 
the proposed models is automatic, and is therefore suitable to be incorporated 
into a fully automatic estimation tool. 

Experimental results on a set of RTL macros with the characteristics de- 
scribed above, taken from the Synopsys DesignWare library, demonstrate the 
increased accuracy of the proposed models with respect to both conventional 
black-box models and ad-hoc models where control signals are treated separately 
from the other inputs. 

2 Semi-autonomous Sequential Macros 

Consider an up-down counter with four modes of operation controlled by three 
control signals: Ld, Cen and UpDn. When Ld is 1, current input data Datain 
are loaded into the internal register. When Ld is 0 and Cen is 0, the counter is 
idle. When Ld is 0 and Cen is 1, if UpDn is 1 (0) the content of the register is 
incremented (decremented) by 1 independently of the input data. The internal 
state is always observable from primary outputs DataOut and an additional 
terminal-count flag (TerCnt) is raised whenever the all-1 state is reached. 

Count-up and count-down modes are autonomous operating modes for the 
up-down counter, because its behavior (and its power consumption) is not af- 
fected by the data inputs. We use a Boolean function (^Datain) to represent the 
sensitivity of the macro to Datain, i.e., the set of configurations of the control 
bits that make its behavior sensitive to Datain: 

^Datain(Ld, Cen, UpDn) = Ld 

The above function expresses the fact that Datain affects the behavior of the 
macro if and only if control signal Ld is set to 1. 



X[Ni-l:0] 




C[Nc-l:0] Clk 



C[Nc-l:0] 



C[Nc-l:0] 



a) b) c) 

Fig. 1. Schematic Structure of a Generic Semi- Autonomous Macro (a), and Propaga- 
tion of an Input Signal (b),(c). 
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The up-down counter example has several interesting properties: 

1. It contains a register; 

2. The state of the register is directly observable at primary outputs; 

3. It has different operating modes controlled by a few control signals; 

4. Some configurations of the control signals make it insensitive to (some of) 
the input data. 

We call semi- autonomous a macrocell with the four properties listed above. A 
schematic representation of the structure of a generic semi-autonomous macro 
is shown in Fig. l-(a). Data inputs are denoted by X[Ni-l,0] , primary outputs 
by Y [No-1,0] , control inputs by C[Nc-l,0] , the clock signal by elk. 

There are three main structural characteristics of the macro. First, all state 
bits are also output signals, thus allowing the observability of the internal state. 
Second, control signals C may feed both the combinational logic and the registers. 
Finally, some output signals may directly derive from the combinational logic. 

Fig. l-(b) and -(c) show the propagation of a generic signal (namely, X[j]) 
through the combinational logic, for two different assignments of the control bits. 
The shaded region within the combinational logic represents the sensitivity to a 
given input signal X[j]. Depending on the current value of the control inputs C, 
input signal X[j] may or not propagate to primary outputs and registers. 

In Fig. 1(b), the input signal reaches the outputs of the combinational logic, 
thus affecting the state bits and/or the outputs of the macro. In Fig. 1(c), its 
propagation is blocked by the control signals, withouth affecting the functionality 
of the macro. In the latter case, the macro is autonomous w.r.t. X[j]. 

If we map Fig. I-(a) onto the up-down counter example, X represents Dataln, 
Y represents both DataOut and TerCnt, while C is the array of control signals Ld, 
Cen, UpDn. Fig. l-(b) and (c) represent the propagation of any bit of Dataln 
when Ld is set to 1 and 0, respectively. 

3 Power Models for Semi-autonomous Macros 

In principle, black-box activity-sensitive power models developed for general 
functional units could be applied to semi-autonomous macros as well. Consider, 
for example, a simple regression equation relating the power consumption of the 
macro to the average switching activity at its primary inputs and outputs: 

^BB CQ T ^l^in T ^2^out (f ) 

The c’s are fitting coefficients, the D’s are average transition densities, and Pbb 
is the power estimate given by the black-box model. For a macro with Nin inputs 
simulated for Np 1 patterns, the input transition density is computed as: 

^ Afp 

where zn[] represents the generic input signal. 
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Model (1) can be applied to a semi-autonomous macro by computing the 
input activity as the weighted average of the activity densities at the data and 
control inputs (denoted by Dx and Dc, respectively): 



Pbb = Co -I- Cl 



NxDx + N^Dc 
Nx + N, 



+ C2-Dy 



( 2 ) 



where Nc and Nx denote the number of control and data inputs, respectively. 

From Equation 2 we observe that the same coefficient (ci) statically multi- 
plies the activity of data and control inputs. Hence, they are assumed to have 
the same impact on power consumption; and their contributions are assumed to 
be independent of the operating mode. Both assumptions are non-realistic and 
may lead to estimation errors that may be reduced by taking into account the 
peculiarities of semi-autonomous macros. 

To overcome the first limitation of the black-box model, we observe that the 
activity of data and control signals may have a different effect to the power 
consumed by the macro. Hence, we split the second term of equation (2) and 
use two independent coefficients for Dx and Dq- We denote by Plin the power 
estimates provided by the new linear model: 



Plin = cq + c\Dx + c^Dc + c^Dy 



( 3 ) 



Second, we observe that the power contribution of each input signal may depend 
on the operating mode of the macro. In particular, we expect the activity of input 
signal X [ j ] to have a sizeable impact on power consumption when it affects the 
functionality of the macro, and little or no impact when its propagation through 
the combinational logic is blocked by the control signals. 

In the up-down counter example of Section 2, propagation of input data is 
conditioned to Ld = 1. Hence, we could characterize two coefficients for Dx (i.e., 
the activity density of Datain) to be used alternatively depending on the value 
of control signal Ld: 



PcND — Co + Cl Ld^^X.Ld ^Ld + Ci i^yDxpd' Phd' + C2Dc + C^Dy (4) 

where subscripts Ld and Ld' denote quantities referring to operating modes with 
Ld=l and Ld=0, respectively. In particular, Dxpd denotes the transition density 
computed on the subset of input patterns with Ld=l, while PLd denotes the signal 
probability of Ld. Notice that Dx,h& is the conditional probability of having a 
transition on a data input when control signal Ld=l. 

If we assume data and control inputs to be independent of each other, we may 
replace conditional probabilities with total probabilities (Px.Ld = Dx,hd' = Dx), 
thus obtaining: 

PcND = Co + Ci^odDxPld + Cixd'Px(l ~ ^Ld) + C2DC + C^Dy (5) 

where (1 — PLd) has been used in place of PLd'. The power estimates provided by 
Equation (5) are denoted by P(j]\fD, since they are conditioned to the sensitivity 
function of the data inputs: Poatain = Ld. 
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We can then start from Equation (5) to develop a general model for semi- 
autonomous macros. We need two forms of generalization: First, we need to 
extend the model to the case of general sensitivity functions; second, we need 
to extend the model to handle cases where different sensitivity functions are 
associated to disjoint subsets of data inputs. 

We partition the set of data inputs X[Ni-l,0] into K disjoint subsets Xi, 
..., Xk- All data inputs in the same subset (say, Xj) have the same sensitivity 
function Fxj, i.e., they affect the behavior of the macro in the same operating 
modes. In the most general case, each data input has a different sensitivity 
function {K = Ni), in most cases of practical interest, however, two subsets are 
sufficient. The generalized power model we propose has the following form: 

PCND = Co + J2f=l {c(3,Fxi)DxjPFxi + - Pfx^ )) + (g) 

ck+iDc + ck+iDy 

where Pfx^ is the probability of the j-th sensitivity function. Notice that 
equation (6) is a family of ad-hoc models, that may have different number of 
terms, different input subsets and different sensitivity functions depending on 
the macro. Nevertheless, the power models can be automatically constructed 
and characterized starting from the functional specification of the macro. 

4 Experimental Results 

We applied the proposed power model to instances of all the sequential soft 
macros in the Synopsys’ DesignWare library that meet the definition of semi- 
autonomous macros. Each macro was mapped onto a reference technology li- 
brary characterized for power and simulated by means of Synopsys VSS with 
DesignPower to obtain reference power values to be used for characterization 
and validation. Estimation results are collected in Table 1. 

For each macro, sensitivity functions (and subsets) were directly obtained 
from the functional specification. Each model was characterized using the re- 
sults of a large set of simulation experiments, sampling different input statistics 
and different operating modes. In particular, for a macro with K sensitivity 
functions, 2K + 1 sets of experiments were used. Each set of experiments con- 
sists of 10 simulations of 50 patterns each, and was conceived to exercise a given 
operating mode (characterized by a fixed value of a given sensitivity function) 
under different data statistics. To this purpose, input streams were generated 
by assigning fixed values to the control inputs appearing in a given sensitivity 
function, and changing the remaining (data and control) inputs according to the 
given input statistics. 

For each experiment, input/output transition densities, control signal proba- 
bilities, and the probability of all sensitivity functions were computed and stored 
in a row of a characterization matrix. Finally, the power model was automati- 
cally built and characterized to fit the data in the matrix. Black-box and linear 
models were also characterized for comparison. 
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Table 1. Experimental Results for Design Ware Semi- Autonomous Macros. 



Macro 


Stream 

Type 


Pbb 


Plin 


PCND 


AvgErr 


StdDev 


AvgErr 


StdDev 


AvgErr 


StdDev 


D WOS-shift^reg 


Char 


8.68 


7.36 


7.72 


6.78 


6.16 


4.80 


RcUid 


9.14 


4.79 


4.59 


3.03 


3.07 


2.67 


Ld=0 


20.81 


2.79 


19.38 


2.83 


10.99 


5.73 


Ld=l 


3.73 


2.97 


3.43 


2.52 


5.29 


3.37 


Shift=0 


4.77 


3.79 


4.86 


3.15 


4.83 


3.55 


Shift=l 


5.04 


3.29 


5.81 


3.80 


5.46 


3.81 


Average 


8.70 


4.17 


7.63 


3.69 


5.97 


3.99 


D WOSJfsrMcnto 


Char 


4.37 


3.12 


3.75 


3.15 


3.59 


3.01 


Rand 


4.46 


2.76 


3.68 


3.13 


3.86 


2.96 


Ld=0 


3.74 


3.44 


3.87 


3.30 


3.45 


2.89 


Ld=l 


4.91 


3.34 


3.69 


3.05 


3.46 


3.22 


Cen=0 


33.75 


12.41 


31.40 


11.74 


28.57 


13.28 


Cen=l 


11.17 


2.60 


10.00 


2.19 


11.39 


1.94 


Average 


10.40 


4.61 


9.40 


4.43 


9.05 


4.55 


D W03Mctr_dcnto 


Char 


6.18 


5.48 


5.07 


4.69 


3.56 


3.37 


Rand 


5.72 


3.80 


4.23 


2.63 


3.05 


2.11 


Ld=0 


6.02 


4.01 


4.36 


3.62 


3.34 


3.04 


Ld=l 


6.79 


7.75 


6.62 


6.58 


4.29 


4.47 


Cen=0 


5.19 


5.32 


5.25 


4.80 


4.60 


3.72 


Cen=l 


8.53 


4.40 


10.05 


3.83 


7.94 


3.46 


Average 


6.41 


5.13 


5.93 


4.36 


4.46 


3.36 


DW03-bictr_decode 


Char 


8.64 


7.51 


8.23 


7.57 


6.87 


6.99 


RcUid 


8.91 


5.58 


7.27 


6.09 


6.46 


5.71 


Ld=0 


11.41 


10.27 


10.76 


10.61 


9.87 


9.18 


Ld=l 


5.58 


4.14 


6.66 


3.86 


4.28 


3.91 


Cen=0 


8.95 


5.84 


10.25 


5.51 


9.43 


5.36 


Cen=l 


10.03 


4.30 


12.12 


4.45 


10.33 


4.29 


Average 


8.92 


6.27 


9.22 


6.35 


7.87 


5.91 


D W03Jfsrscnto 


Char 


5.50 


3.86 


4.61 


3.58 


4.53 


3.56 


Rand 


5.56 


3.58 


4.52 


3.12 


4.66 


3.19 


Ld=0 


4.81 


3.77 


4.68 


3.59 


4.12 


3.21 


Ld=l 


6.13 


4.16 


4.64 


4.05 


4.80 


4.22 


Cen=0 


60.51 


28.66 


53.67 


26.51 


50.17 


27.22 


Cen=l 


14.73 


4.52 


13.55 


3.82 


14.79 


3.90 


Average 


16.21 


8.09 


14.28 


7.45 


13.85 


7.55 
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Similar experiments have been performed for evaluation, yet with different 
input statistics from the characterization phase, so that also the out-of-sample 
accuracy has been evaluated. Evaluation streams have been synthesized so as to 
represent realistic situations as much as possible, that is, meaningful alternation 
of different operational modes. Accuracy has been measured in terms of average 
error and standard deviation, defined as: 

AvgErr{%) = ^ ~ . iQO (7) 

^ Preal{i) 



StdDev{%) 




l-^esf(^) Prealid^\ 

Preal (^) 



n 2 



— StdErr 



■ 100 



(8) 



For each macro, we have reported the estimation error and standard devia- 
tion for various streams, each one corresponding to a different statistical profile. 
Stream Char refers to a stream with similar properties to the one used for the 
characterization. This row reports then the in-sample error. 

Stream Rand is constructed without separately exercising control and data 
inputs, and by applying uniform white noise to all the input bits. This stream 
represents the worst case for our model, since the advanatage of exposing control 
variables is lost. 

The other streams clearly depend on the specific macro. Most have the form 
inpuEname = value, denoting the fact that the stream has been built with that 
input signal stuck to that particular value. 

Results show that the CND model consistently yields higher accuracy than 
the two other models, in terms of both error and standard deviation. It is im- 
portant to emphasize that, although the improvements appear to be limited, 
data and control inputs are not correlated in the streams we have considered for 
testing the model. In fact, even test streams with a fixed input, do not actually 
have an effect on the switching of the data inputs. We can thus claim that the 
evaluation conditions used for Table 1 represent the worst case improvement in 
accuracy. 

To further observe where the proposed model improves over the others, we 
analyze the results for a specific macro, namely a Linear Feedback Shift Register 
with parallel load (corresponding to the Design Ware macro DW03Jfsr_load). 

We compare a conventional linear regression model as the one of Equation 3 
with a model based on the derivation of Equation 6: 



Pun 



— CoDclk + CiDreset + C2Dioad + C^D cen H" C^Dcount H" C^Ddata] 



P CND = elk ^l^reset C2D 

load + C3D cen + CiD 

count~^ 

dataPload C^DdataiX Pload^- 

After characterization, the two models are extracted, yielding the following 
regression coefficients: 
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LIN 


CND 


Co = 1685 
Cl = 2448 
C 2 = 3302 
C 3 = 3068 
C 4 = 5733 
C 5 = 1999 


Co = 1651 
Cl = 2404 
C 2 = 3369 
C 3 = 2902 
C 4 = 6207 
C 5 = 1286 
Cg = 2936 



From inspection of the model, we observe that the main difference between 
Plin PcND the way the dependency between power and the input 

data switching Ddata is modeled. In Ddata is considered as an independent 

variable (and thus depending on a single coefficient), whereas in PqnD^ the 
joint effect of Ddata and the control signal load is considered. This amounts to 
splitting the contribution C 5 • Ddata of Pj^uq in two parts, depending on the value 
of load. 

This is reflected in the values of the coefficients. While Co, . . . , C 4 , that refer 
to the non-controlled inputs and outputs, have similar values in both models, 
coefficient C 5 of the LIN model is actually the average of C 5 and Cq in the CND 
model. 

5 Conclusions 

We have proposed a new power macromodel for control- dominated RTL macros. 
The control inputs may activate input-insensitive behaviors that let the macro 
evolve in a semi-autonomous way. 

The proposed model overcomes the limitations of conventional black-box, 
activity-sensitive power models, because it explicitly represents the correlation 
between some of the control and data inputs by adopting a higher-order model. 

The model, although macro-specific, can be automatically generated because 
it only requires the specification of what control signals affect a set of data 
inputs. 

Results are promising, and have better accuracy over conventional models, 
even for stream that do not enforce the existing correlation between control and 
data signals. 
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Abstract. An approach for power modelling of parameterized, technology 
independent design components (firm-macros) is presented. Executable sim- 
ulation models in form of C-i-H- classes are generated by a systematic proce- 
dure that is based on statistical modelling and table look-up techniques. In 
contrast to other table look-up based approaches the proposed model sepa- 
rately handles the inputs of a component, and with this it allows to model the 
effects of corresponding joint-dependencies. In addition, a technique for the 
generation of executable models is presented. The generated models are 
optimized with respect to simulation performance and can be applied for 
power analysis and optimization tasks on the behavioral and architectural 
level. Results are presented for a number of test cases which show the good 
quality of the model. 



1 Introduction 

Recent years have brought an enormous increase in integration of circuit elements on a sin- 
gle chip. This trend of higher performance and smaller device sizes comes with enormous 
physical challenges. One of these challenges is the power dissipation. High power con- 
sumption means high power costs and short battery life-time of mobile applications. Con- 
sequently power dissipation is an important part of the cost function of modern designs and 
tools that allow to analyze the power consumption already on high levels of abstraction are 
in high demand. 

Meanwhile, techniques for power analysis and low power synthesis on the behavioral 
level have come up [1, 2,3,4]. Given a behavioral description of an algorithm, the tech- 
niques allow an efficient estimation of upper and lower bounds of the power consumptions 
and even suggest power optimal allocation and binding of datapath components. Usually, 
these components are combinational arithmetic and logic units which are provided as so 
called ’firm-marcos’ (VSI Alliance recommendation) in a component library (e.g. Design- 
Ware®-library from Synopsys®). These firm-macros have a defined module architecture 
and are parametric in terms of the word- length. They are provided as technology-indepen- 
dent descriptions which can be mapped onto a specific technology by logic synthesis. To 
guide analyses and optimizations, these estimation and optimization techniques require 
power models for the datapath components which describe the dependency of the power 
consumption on significant macro parameters and support typical optimization steps. 



1 This work is founded by the BMBF project EURIPIDES under grant number 01M3036G. and by the Commission 
of the European Community as part of the ESPRIT IV programme under contract no. 26796 



D. Soudris, P. Pirsch, and E. Barke (Eds.): PATMOS 2000, LNCS 1918, pp. 24-35, 2000. 
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Already, a number of techniques for power modelling of combinational and sequential 
design-components have been proposed. A good overview of existing approaches of power 
modelling techniques is given in 5.. Unfortunately, most of these techniques focus on 
power modelling of so called ’hard-macros’. The structure of these components is fixed 
and a low-level implementation on gate- or transistor-level is available. Only a few of these 
techniques can principally be extended to the handling of datapath components, as for this 
word-length independent model parameters and variables are necessary. For one of these 
techniques such an extension has been presented in [5]. 

A power model dedicated to parameterized datapath components was presented by 
Landman [11]. The model is derived under the assumption of certain signal statistics. 
Unfortunately, statistics of real application data may significantly differ from these 
assumptions, especially in the case of resource sharing. In Section 3 we will further com- 
ment on this approach and on the limitations. 

In the following, we propose a new approach for power modelling of firm macros and 
suggest a corresponding technique for automatic model generation. The generated models 
describe the dependency of the power consumption on significant input characteristics and 
macro parameters. Different module inputs are regarded separately. With this the influence 
of input-data joint-dependencies and of the mapping of the data streams onto module 
inputs is considered. The dependency on the module word-lengths is handled by a regres- 
sion technique that considers architecture informations to minimize the number of proto- 
types which are necessary to fit the model to a specific technology. Furthermore, a 
technique for a systematic generation and integration of executable simulation models is 
suggested. Evaluation results for a number of test cases are presented which demonstrate 
the good quality of the model. 

The rest of this article is structured as follows. In Section 2 we start with a definition 
and separation of the modelling problem. Section 3 describes our concept for statistical 
modelling and presents the approach for modelling the data-dependency. In Section 4 we 
focus on modelling the word-length dependency and explain the handling of control-inputs 
and logic-optimizations. Section 5 explains our technique for generating and integrating 
the models into a behavioral power analysis and optimization tool. This is followed by an 
explanation of our evaluation process and the presentation of results. The paper concludes 
with a brief summary in Section 6. 



2 Problem Definition and Separation 

The problem of modelling the power consumption of combinational firm macros is the 
problem of identifying a functional relationship between the power consumption P and 
1) significant characteristics of two consecutive input vectors G(X[n - 1], X{n \) , 2) the 
vector of input word-lengths 5VF of a component instance, 3) the architecture of a compo- 
nent instance A and 4) the mapping technology T . 

P[n] = /(G(X[m - 1], A[«l), BW, A, T) . (1) 

This problem can be separated into smaller sub-problems. Without loss of generality, 
separate models can be used for different architectures and technologies while the tech- 
nique used to generate the model is the same. With this, A and T disappears from the 
equation. By heedfully choosing a set of model variables the dependency of P from 
G(X[n - 1], X[n]) and BW is approximately statistical independent (we will show this in 
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Section 4), so that the problem reduces into the separate issues of; 1) modelling the depen- 
dency of the power consumption in terms of the input vector characteristics 
G{X{n - 1], X[n]) and 2) modelling the dependency of the component word-lengths: 

P[n] = f(G(X[n - 1], X[n])) ■ h{BW) . (2) 

A wide variety of statistical modelling techniques exist that help to systematically 
ascertain such relationships (for an overview see [15,16]). The quality of this functional 
relationship can be measured by proper statistical measures, e.g. the mean square error, etc. 
which describe the discrepancy between the true power values P[n] and the estimated val- 
ues P[w] . 



3 Modelling Data Dependencies 

In this section we describe our approach for systematically deriving a statistical model that 
describes the data dependency of the power consumption. The process of model derivation 
that has been used can be separated into the steps of model identification and model fitting. 
Model identification is the process of finding or choosing an appropriate functional rela- 
tionship for the given situation. Model fitting is the stage of moving from a general form to 
a numerical form. Because of the limited space of the paper we can not go into the details 
of all modelling steps, but instead present some basic methodologies and underlying statis- 
tical techniques that are used. 

3.1 Model Identification 

The process of model identification contains the steps of data identification, model param- 
eter selection and the identification of a functional relationship. Data identification is the 
process of analyzing the input data with the aim of deriving hints for the selection of model 
variables, parameters and forms of the functional relationship. 

Data Identification and Model Parameter Selection 

On high levels of abstraction stationary signals are represented as abstract values. Charac- 
teristics of the signals are mean p , variance and temporal correlations p [8]. On lower 
levels, statistics of bits and bit-vectors are of interest. Bit-characteristics are signal proba- 
bility p , switching activity t and temporal correlations. Characteristics of bit-vector 
streams are the Hamming-distance Hd , average values of switching activity and signal 
probabilities as well as measures of spatio-temporal correlations [9,10]. In addition to data 
models, techniques for empirical and analytical estimation of bit-level statistics from word- 
level statistics exist [11,12]. These models are usually restricted to simple GauB-distribu- 
tions or simple AR-models. Assuming such input streams leads to bit-level statistics with 
some typical characteristics, which can be used to derive a power model. Landman and his 
dual-bit-type model was the first who consequently used this technique to develop a power 
model for datapath components [11]. 

The disadvantage of this methodology is that it restricts the application of the power 
model to applications where the assumption on the distribution can be assured. Unfortu- 
nately, this assumption does not hold for a number of real applications (e.g. [17]) especially 
in the case of resource sharing, as different input streams are mixed here. 
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So, for not restricting the application, we have considered model variables and param- 
eters during the selection process that do not require any high-level data models. Instead 
we consider bit-level statistics which have a high significance with respect to the modules 
power consumption and which can efficiently be captured during a functional simulation of 
the design. Nevertheless, our model variables values can be estimated analytically from 
high-level statistics for a number of typical cases, e.g. GauB- or Laplace distributions. 
Hence, our modelling approach can also be used for fast probabilistic simulation tech- 
niques on RT-level, where high level statistics are propagated through the design. The 
methodology for calculating our model parameters from high level statistics has been pre- 
sented in [13]. 

For selecting our model variables and parameters we used a mixture of empirical and 
analytical techniques. Different sets of model variables have systematically been chosen 
and have been evaluated for typical datapath components using statistical model selection 
techniques and significance analyses [15]. Because of the limited space we will not go into 
further detail, but focus on the result of this step. 

To characterize a sequence of two consecutive bit- vectors X[n - 1], X[n] at one 
module input, we use the Hamming-distance Hd and the number of digits with a fixed 
value equal to zero #0 or one #1 . Consequently, a transition T of two consecutive bit- vec- 
tors is characterized by: 

T[n] = (Hd[n],#0[n],#l[n]) (3) 

with 

Hd[n] = Hd{X[n- \],X[n]) = \^{i\(X[n- \]-^X[n]j)}^ 

#0[«] = |{i|(X[«-l],.) = X[n],. = 0}| 

#1[«] = \{i\(X[n-l]i) = X[n].= l}\ 

for I < i< m and m as the vector word-length. 

As for an instantiated component the word-length of an input is fixed 
(m = Hd -I- #0 -H #1 = const ), only two of the three variables are independent. Further- 
more, we normalize these values to the word-length m to get a word-length independent 
value-range, so that we choose the normalized values Hd = Hd/ m and #0 = #0/ m as 
model variables. We desist from the usage of module output statistics as model parameters 
in general (except for multiplexer), as it is difficult to generate output values with a specific 
combination of statistics, which is necessary during the model fitting. 

The significance of the chosen parameters is exemplified in Figure 1. The figure 
shows the average charge consumption per transition over the Hamming-distances at the 
multiplier-inputs A and B. Two settings of the non-switching bits are distinguished. Points 
on a line have a constant //(i -sum = Hdg^^ = so that only the dis- 

tribution of the total value {Hdsu ^ ) onto the inputs varies. 

From the figure it is clear that the charge consumption strongly depends on the Ham- 
ming-distance and on the distribution onto the module inputs (values differ by a factor of 
up to 3 for constant values of Hdg ^^ ). Furthermore, it can be seen that the power con- 
sumption for a certain input stream at one input, strongly depends on the input data of the 
other input. It is important, that these influences can only be handled by separately regard- 
ing the data at the module inputs. Furthermore, the significance of the signal-values of the 
non-switching bits is of interest. The figure contains the average charge consumption per 
transition for different combinations of Hd ’s at the inputs for the case that all non switch- 
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ing bits are ’0’ and ’ 1’ . It can be seen that the charge consumption differs by a factor of 4 to 
10 for corresponding points. 




Fig. 1: Average power consumption of a 24x24 bit CSA-multiplier (a) and Booth-Coded Wallace- 
Tree multiplier (b) over the normalized Hamming-distances at inputs A and B for all non-switching 
bits ’0’ and ’1’, respectively 

Summing up, the variables chosen allow a good distinction of transitions in terms of power 
for a wide variety of different design components and architectures. The capturing of the 
variable values from the signal values can efficiently be processed by a table look-up tech- 
nique. To calculate a variable value for two consecutive input vectors, only an exor-com- 
mand and an array access is required, i.e. no time-consuming loops and if-commands are 
necessary. 

Identification of Functional Relationships 

As it is difficult to infer a functional relationship that holds for the complete value-range, 
we decided to use an interpolation technique for localized approximations. Because of per- 
formance reasons, we apply a multi-dimensional linear interpolation technique. Values 
between neighboring grid-points are approximated by first-order Taylor-rows/. The differ- 
ential-coefficients of the Taylor-row are approximated by difference-coefficients, which are 
calculated from the function values p of the nearest grid-points [18]. 

This technique allows a fast calculation for a given set of variable values, as it only 
requires to select and calculate a corresponding function /. The selection of an adequate 
grid-size and the calculation of the approximation-functions is explained in Section 3.2. As 
an alternative to the presented interpolation technique, we suggest the usage of a technique 
presented in [20]. This technique has a higher flexibility and smaller memory demands, but 
leads to higher computational cost. 

3.2 Model Fitting 

Model fitting is the process of moving to a numerical form. For our approach this includes 
the definition of the grid-size and the determination of the interpolation functions /. For 
the interpolation functions it is necessary to estimate the functional values at the grid points 
and the difference-coefficients. 
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Grid-Size Identification 

The grid-size can iteratively be determined by an analysis of appropriate error measures. 
We use an empirical technique, where the n samples are regarded that are located in the 
geometric centre of neighboring grid-points (intermediate points). The ’true’ values at 
these points (taken from a simulation of a component prototype) and values calculated by 
interpolation functions are evaluated by the mean square error : 

= -Y" (P[«]-P[«])^ (4) 

1 

with: 

P[m] : values from simulation, 

P[m] : values from interpolation. 

The procedure for grid-size estimation is then as follows: 

1) define an error-limit MSEumit 

2) set grid-size A to initial value 

3) estimate ’true’ function values for grid-points and calculate interpolation-functions 

fu 

4) estimate ’true’ function values for intermediate points, 

5) calculate MSE , 

6) if MSE > MSEiimit set A = A/2 and repeat 3) to 6), otherwise stop procedure. 
Instead of globally reducing the grid-size it is also possible to locally reduce the grid-size, 
if a more detailed data analysis is processed. Furthermore, it is important that the grid-size 
must comply to the word-length’s of the component prototype, e.g. for a 16 x 16 bit com- 
ponent only multiples of 1/16 are possible. 

For all components we have analyzed until now (see Section 5) the interpolation tech- 
nique works very well and a step-size A = 0.25 was sufficient to achieve average interpo- 
lation errors which are less than 5-10%. As an example. Figure 2 illustrates the quality of 
the interpolation technique for an 16xl6bit carry-save-array-multiplier. 
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Fig. 2: a) Comparison of values at intermediate grid-points; b) Comparison of estimated and ’true’ 
intermediate grid-point values 



Model-Parameter Estimation 

The process for parameter estimation is as follows: 

1) generate a stream of input-patterns for all grid-points, i.e. a stream where consecutive 

input vectors have a defined characteristic, 

2) perform a power simulation of a component prototype for each pattern stream, 

3) use the average charge consumption per transition as estimation for the ’true’ model 
parameter (corresponding to the method of smallest squares, the average is the best estimate). 
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4) calculate error margins based on the central limit theorem, 

5) if error margins are not met, extend the pattern stream by a number of new vectors 
which can be estimated from the central limit theorem [19]. 

It has to be mentioned that steps 4) and 5) can only be executed if a cycle accurate power 
simulator is used. 

For the components we have characterized, we have found that in 95% of all cases a 
vector-stream of 100 patterns leads to error margins of less than 5%. So, using a fixed set of 
200 vectors is sufficient in practice. It is important that the patterns for fitting our model 
can be generated as a continuous stream, which can be simulated in one run (for each grid 
point). I.e., it is not necessary to run the simulator n times to simulate n vector-pairs, 
which is usually very time consuming. Because of the limited space, we omit the presenta- 
tion of the algorithm for the generation of characterization pattern streams. 



4 Modelling Word-Length Dependencies 

The process of modelling the word-length dependency also consists of the sub-problems of 
model-identification and fitting. In contrast to the procedure of model-identification which 
has been used for modelling the data dependency and that mainly relies on empirical tech- 
niques for data analysis, the procedure for word-length dependency modelling is based 
stronger on conceptual techniques. If available, we extract the form of the functional 
dependency from the architecture of the component, i.e. we use the knowledge about the 
component structure. 

4.1 Model Identification 

The problem of describing the dependency on the word-length can be mapped to the prob- 
lem of describing the influence of the word-length on the interpolation function or function 
parameters p (values at grid-points), respectively: 

Pi,j = ^i,pW), (5) 

where i, j denotes a certain grid point. 

As: 

Pij = 2'^dd-C-a.j 

with: 

Vdd ■ supply voltage, 

C : module capacitance and 

a : an average activity factor, which is constant for a certain grid-point, 

it follows that the model parameters Pj j are proportional to the module capacity, which is 

a function of the input word-length’s for fixed architecture: 

Pij-^Cocki .{BW) (6) 

As suggested originally in [11] the form of this dependency can be derived from the 
architecture of a component. For example, the dependency for an carry-save-array multi- 
plier with the input word-lengths BW = {bw^, bwg} is of the form: 
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Pi,j= =r2-(bw^-bwg)+ ry(bw^ + bWg)+rQ 

T 

with: R = [^2 Tj Tq] the vector of function parameters. It is import that these depen- 
dency functions also allow the handling of components with multiple inputs, that have dif- 
ferent word-lengths. 

Since for small grid-sizes the number of parameters and with this the number of corre- 
sponding functions k might be large, it is necessary to use an additional approximation. 
So, instead of regarding the dependency of each parameter we use an average dependency. 
This is done as follows: 

1) Normalize all parameters: 



with: 



BW 

BW _ Pi, j 
j BW 
P norm 



pfW :a parameter p,- ,■ for a component prototype with fixed word-length BW , 

Pn^rm average of all parameters pf ^ as norming value. 

2) Average the normalized parameters over the word-length: 

avg 1 

^ " size(C) ' ^ ^AL 

BWg C 

with: 

C a set of component prototypes with different word-lengths. 

With this the problem reduces to the problem of fitting the function that describes the 
dependency of the norming value on the word-length BW : 



P 



norm 



= k(BW). 



If a numerical form is determined, the parameters p- j(BW) can be estimated by 



approx 

Pu 



(BW) = p 



avg 



■ P 



norm 




■ k(BW) . 



(7) 



The effects of this approximation step can be evaluated by analyzing the differences of the 
’true’ and the approximated values, in terms of average deviation or mean square errors. As 
an example. Figure 3a shows some ’true’ normalized parameters and the approximated val- 
ues, again for the CSA-multiplier. For a grid-size of A = 0.25 the average difference of 
true and approximated parameter 



,avg 



1 



‘’J size(C) 



Z , BW approx , 

iPi,j -Pi,j (BW)) 

BWg C = {SxS, 16x16,24x24, 32x32} 



is illustrated in Figure 3b. It can be seen, that the (relative) deviation are less than 10% in 
90% of all parameters. The large error peeks seen in Figure 3b are for parameter values 
which have very small absolute values, so that the impact on the estimation accuracy is 
usually small. 

Nevertheless, if this can not be accepted, it is possible to use local approximations 
instead of a global one to reduce the deviations. 
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Fig. 3: a) Comparison of true and approximated values for transition power (CSA-mult.); b) Devi- 
ations of approximated an true parameters, for prototypes with wordlength 
18x8,16x16,24x24,32x32} 



4.2 Model Fitting 

For model fitting we use a regression approach, which calculates the model parameters 
based on the method of least squares. This is done as follows: 

1) select a set of component prototypes C with different word-lengths, 

2) extract the parameters p. j as described in Section 3, 

3) use approximation technique described in Section 4.1 
to calculate normalized parameters Y , 

4) define the form of function k(BW) based on architecture information, 

5) start a regression process to determine the parameters R , 

6) evaluate the quality of the regression, and adapt the number of prototypes if necessary. 
The quality of the regression can be evaluated in terms of correlation coefficients and risk 
functions. With this it can be evaluated how well the values used as input to the regression 
(estimation values) are approximated by the regression function. This measure can also be 
used to measure the quality for values not used within the regression process (test values). 

From our experiences we have found, that a small number of components is sufficient 
to fit the functions k{BW) . To fit the function for word-lengths in the range from 8 to 32 
bits using four component prototypes (8,16,24 and 32 bits) is sufficient to achieve devia- 
tions of less than 5%. Nevertheless, for components with unknown architectures it is possi- 
ble to use eclectic techniques to select an adequate functional relationship [15,14]. The 
price for this is an increase of the number of prototypes to achieve a defined accuracy and 
confidence. 

The handling of control inputs and the influence of logic optimizations is straight for- 
ward. As most of the datapath components only have one control input, for each setting of 
this input (’O’, ’ 1’ or toggling) a separate model is generated. Logic-optimizations are con- 
sidered by generating minimum area and delay variants of a component and adapting the 
norming value and word-length dependency function. 



5 Evaluation 

The proposed technique for power modelling has been realized as an interactive modelling 
tool, which is a part of the OFFIS behavioral-level power estimation tool-suite ORI- 
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NOCO®. The tool generates C-H- classes which encapsulate the complete parameterizable 
model. The realization as C-H- allows a simple and flexible integration into power analysis 
and optimization tools. An open interface methodology simplies the integration of third 
party models for IP components. 

Until now we have generated power models for a number of components that are rele- 
vant for behavioral VHDL power analysis. For each component a set of prototypes used for 
characterization and a set used for validation was generated. For each prototype the estima- 
tion accuracy has been analyzed for different sets of evaluation data (test cases to stress the 
model) and some sets of real application data. Figure 4 gives an overview of the compo- 
nents and data-sets involved in the evaluation procedure. The quality of the model has been 
evaluated in terms of absolute and relative accuracy of the average- and cycle- accurate 
power estimates. Furthermore, we have compared our approach to the DBT-model and a 
technique presented in [7] which allows cycle accurate estimates for components with a 
fixed word-length, and achieved better results for a number of practical designs, especially 
where input-streams are mixed due to resource sharing. 
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Fig. 4: Overview of components used within the evaluation procedure 



Because of the limited space we only present a sub-set of the evaluation results. Table 
1 presents the estimation errors for a number of different components and test-data sets. 
Estimation errors are delivered for the case that a model is built for a single instance 
(instance model) and for the case that a word-length parameterizable model (BW-param. 
model) is used. This allows to evaluate the effects of the word-length parameterization. 
Estimation errors are presented for randomly generated input streams (rand.) and for 
streams with data characteristics, so that the corresponding model variable values lie on 
intermediate grid points (i.m). These intermediate points are equally distributed over the 
complete value range, and with this they stress the model over a wide variety of possible 
input streams. Furthermore, these intermediate points are a worst case to the modelling 
approach, as the total difference between intermediate and (characterized) grid points is 
maximum, which is crucial for the interpolation. The deviations for this case are given as 
the average of the absolute values of relative errors in the table: 

" size{IM)^i^ m 

with: 

IM : set of all intermediate grid points, 
qinodel ■ cycle charge consumption estimated by the model, 
q^ef : cycle charge consumption from logic level simulation 

and as average relative estimation error (i.m. avg.). From the table, it can be seen that, 
except for the case of small word-length of the csa-multiplier and rpl-divider, the impact of 



model ref 
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the word-length dependency modelling is very small. Even for the intermediate grid points, 
the estimation accuracy is acceptable and within error-bounds of 10 to 15% over the com- 
plete value range (cf. Figure 2b). 

Table 1. Estimation errors in % for different components and input patterns compared to 
logic level simulation 



compo- 

nent 


BW 


instance model 


BW-param.model \ 


rand. 


Lm. 


Lm. av^. 


rand. 


Lm. 


Lm. av^ 


mult- 

Csa 


8x8 


-1 


8 


-5 


-18 


12 


9 


16x16 


-2 


9 


-5 


-12 


16 


12 


24x24 


-3 


9 


-6 


-1 


10 


-4 


32x32 


-2 


11 


-5 


2 


8 


2 


mult- 

Bcwt 


8x8 


-3 


12 


-2 


-6 


15 


8 


16x16 


3 


12 


-7 


-1 


14 


-11 


24x24 


-1 


13 


-9 


-3 


15 


-9 


32x32 


4 


13 


-9 


-5 


15 
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8x8 


-3 


12 


-4 


-7 


14 


-11 


16x16 


-5 


12 


-1 


-5 


13 
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24x24 


-3 


9 


-5 


-1 


11 


-3 


32x32 


-3 


10 


-4 


-2 


11 


-1 
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-1 
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-1 
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-6 


-2 


9 
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-5 
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6 Summary 

In this paper we have presented a concept for power modelling of parameterized datapath 
components. The approach is based on statistical modelling techniques and allows a sepa- 
rate handling of module inputs. With this, the model allows to consider the influence of 
input-data joint-dependencies. The separate handling of inputs furthermore allows to 
model the influence of the mapping of data streams onto inputs, which is especially of 
interest for non-symmetric module structures. This information can be used for commuta- 
tive operations to optimize the binding in terms of power. The significance of the proposed 
model variables and the adequateness of the modelling form has been shown for several 
examples. 
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Furthermore, we proposed a technique for model generation, which produces execut- 
able simulation models as C-H- classes that are optimized with respect to simulation effi- 
ciency and flexibility. In combination with table look-up techniques for capturing the 
model variables from simulation data, a very efficient and simple integration into high- 
level power analysis tools is possible. 

The high simulation performance, the parameterizability, the separate consideration of 
module inputs and the automatic model generation in form of executable C-H- model make 
the model attractive for high level power estimation and optimization tasks. The results that 
have been presented for a number of different test cases show the good quality of the 
approach. 
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Abstract. Power dissipation due to the steering logic, that is, the multi- 
plexer network and the interconnect, can usually account for a significant 
fraction of the total power budget. In this work, we present RTL power 
models for these two types of architectural elements. The multiplexer 
model leverages existing scalable models, and can be used for special 
complex types with re-configurable numbers of data bits and ways. The 
interconnect model is obtained by empirically relating capacitance to 
circuit area, that is either estimated by means of statistical models or 
extracted from back-annotation information available at the gate level. 



1 Introduction 

Although several works have addressed the problem of RTL power estimation (see [1] 
for a survey), most have proposed power models for either the datapath modules (in- 
stantiated in the RTL description as a result of behavioral synthesis) or for the control 
logic driving those modules. 

Besides the contribution of such elements, that are explicitly exposed in the RTL de- 
scription as either HDL statements (the controller) or synthetic operators (the datapath 
modules), also the steering logie, that is, the multiplexer network and the interconnect, 
can usually account for a significant fraction of the total power budget. 

In spite of their potential impact, especially for design with a large amount of shared 
resources, only a few works have addressed the problem of estimating the power due 
to the steering logic. Different motivations are at the basis of this limited analysis. 

Multiplexers are not usually considered during RTL estimation essentially because, 
unlike datapath operators, they are not explicitly instantiated in the specification; 
rather, they are generated during the high-level synthesis as a result of resource sharing. 

Similarly, the impact of the interconnect is usually neglected because it requires 
information on the physical implementation of a design. This implies that, unlike datap- 
ath modules, wire capacitances (and thus power) cannot be pre-characterized. Some ap- 
proaches have dealt with this problem by leveraging existing statistical models [2, 3,4, 5] 
that relate the length of the interconnect to macroscopic parameters that can be more 
easily inferred from a high-level specification [6]. 

Regardless of its complexity, the impact of steering logic cannot be simply ignored; 
it has to be carefully accounted for to achieve absolute power estimates during design 
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validation, and to make significant comparisons during design exploration. The follow- 
ing example shows the impact of MUXes and wires on power dissipation of a design 
with different amounts of sharing. 

Example 1 The behavioral specification of an elliptic filter fellipf ), taken from [7] 
contains 26 additions. We synthesized three alternative RTL implementations of the 
filter: ellipf26, with 26 adders and latency 1; ellipflO, with 10 adders and latency 
3; ellipf 1 with 1 adder and latency 26. Using Synopsys’ DesignCompiler, we mapped 
the three implementations onto a gate-level library characterized for power and we per- 
formed gate-level simulation and power estimation by means of VSS and DesignPower. 
The same clock period (of 50ns) and data .stream (of 100 patterns) were used for all 
implementations. The energy budgets are reported in the following table: 





Energy 


1 Percentage I 


ellipf26 


ellipflO 


ellipfl 


ellipf26 


ellipflO 


ellipfl 


ADDER 


28670 


31950 


26910 


68.61 


32.85 


11.92 


RANDOM LOGIC 


505 


4185 


52390 


1.21 


4.30 


23.20 


MUX 


0 


23670 


110630 


0.00 


24.33 


48.99 


WIRES 


12610 


37470 


35880 


30.18 


38.52 


15.89 


TOTAL 


41785 


97275 


225810 


100.00 


100.00 


100.00 



We notice that: i) The total energy consumption is significantly different; ii) The energy 
spent in performing the sums is almost the same (the only difference being due to the 
different signal statistics at the inputs of shared resources); Hi) Wiring and MUXes 
may be responsible of more than 50% of total energy. 



In this work, we address the problem of estimating the power contribution of the 
steering logic. We refer to a high-level-synthesis flow that takes a behavioral specifica- 
tion and builds an RTL implementation, based on a given library of functional macros 
(hereafter called RTL library). We consider RTL descriptions that consist of a set of 
(both hard and soft) macros belonging to the RTL library, some sparse logic imple- 
menting the controller, and the steering logic that is used to properly connect and drive 
the datapath modules and the controller itself. 

In the flow, RTL simulation is used to evaluate the power consumption of the pro- 
posed implementation. This is realized in two steps: First, a specific power macromodel 
for each macro belonging to the RTL library is built. These models are meant to ex- 
press a relation between actual power and some higher-level quantities such as input 
statistics. Second, power is obtained by summing the result of a context-dependent 
evaluation of the power models for each macro, plus the contribution of the control 
logic, which is modeled separately, using a different approach. The power macromodels 
for the RTL macros are either pre-characterized, or constructed and characterized on- 
line, during high-level synthesis, whenever a power estimate is required for a not-yet 
characterized macro. In both cases, power characterization requires fast synthesis of the 
macro and mapping on a reference technology library. Both the pre-characterization 
paradigm and the direct link to the synthesis flow are essential for the discussion of 
the models for the steering logic we propose in this paper. 

It is important noticing the different nature of the models proposed in this work. 
Multiplexers are soft macros that can be specialized by specifying the number of ways, 
the bit-width and the encoding used for selection. Any instance can be synthesized 
and pre-characterized for power. The model we present is general (it is applicable to 
any MUX with any number of ways) and scalable with respect to the bit-width of the 
data path (the same model can be scaled to be used for MUXes with different bit- 
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widths without re-characterization). Wiring power cannot be pre-characterized. More- 
over, wiring is usually unknown at the RTL. We do not propose a new power model for 
wiring; rather, we show how existing gate-level wire models can be exploited to obtain 
accurate estimates of wiring power while working at the RTL. 



2 Wiring 



The power consumed in charging and discharging wiring capacitance is expressed as; 



Pwiring — ^ ^ 



( 1 ) 



where Vdd is the supply voltage, T is the clock cycle, ai and Ci are the switching 
activity and the total capacitance of the i-th net, and the sum is extended to all RTL 
nets (i.e., to all nets connecting RTL modules). We assume T and Vdd are specihed by 
the designer, while ai is computed, for each net, during RTL simulation. Hence, the 
task of modeling wiring power reduces to that of modeling wiring capacitance. 

The parasitic capacitance associated with the interconnection between two or more 
modules is the sum of many contributions: the output capacitance of the driving com- 
ponent Cout, the input capacitance of all driven components Ci„ and the actual wiring 
capacitance Cwire- In general, however, the power contribution of Cout is usually implic- 
itly modeled by the power model of the driving macro. If this is the case, it doesn’t need 
to be also ascribed to the net, or otherwise the total power would be overestimated. 
On the contrary, the input capacitance of a macro does not contribute to its power 
consumption when it is simulated in isolation for characterization. In fact, the power 
estimates provided by gate-level or circuit-level simulation represent the power drawn 
from the supply net, while input capacitors are directly charged by primary-input lines. 

Figure 1 schematically shows the parasitic capacitors connected to a net. All ca- 
pacitors represented within the boundary of the driving macro contribute to its output 
capacitance Cout, all capacitors represented within a driven macro contribute to its 
input capacitance C^^, while the sum of all external capacitors is C^ire- We denote by 
Ci„ the sum of the input capacitance of all driven modules. According to the above 
observations, we neglect Cout and we compute the actual capacitance to be associated 




Fig. 1. Hierarchical Topology of a Generic Wire Connecting RTL Modules. 
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with a net as the sum of Cmire and Ci„: 



FO 

C' = C„ ire ^iri — Cwire “f 



(2) 



In the following we discuss the estimation of C^ire and within the high-level- 
synthesis flow described in the introduction. 



2.1 Wiring Topology 

The generic wire shown in Figure 1 has a hierarchical structure with three fan-out 
points: the hrst fan-out point, internal to the driving macro, distributes the output 
signal to some internal gates that take it as an input to realize more complex output 
functions; the second fan-out point, at the RTL, distributes the output signal to all 
driven macros; the third fan-out point, internal to each driven macro, distributes the 
primary input to several internal gates. 

The actual structure of a wire, however, is unknown before placement and routing 
and is not necessarily hierarchical. In many cases, placement and routing tools take 
a flattened gate-level netlist and break the RTL structure to find optimal solutions. 
Nevertheless, in the following treatment we assume the hierarchical structure depicted 
in Figure 1 to provide an early estimate of wiring capacitance at the RTL. This as- 
sumption, though arbitrary, makes the estimation of wiring power consistent with the 
estimation of the pre-characterized power models used for functional units, and enables 
consistent comparisons between alternative design solutions. 



2.2 Wiring Model 

The topology shown in Figure 1 can be viewed as the hierarchical composition of fan-out 
points. A wire with a single fan-out (or, equivalently, with a flattened fan-out topology) 
can be viewed as a basic block for building any hierarchical structure. Without loss of 
generality, in this section we focus on modeling the capacitance of a wire with a single 
fan-out point. With respect to a fan-out point, we call stem the incoming segment, and 
branch each out-coming edge. The wiring capacitance associated with a wire is the sum 
of its stem and branch capacitances: 

Npo 

Cruire = C s + ^ = C s + NpoCb (3) 

where Cb denotes the average branch capacitance, Cs the stem capacitance and Npo 
the number of fan-out branches. While Npo is available at the RTL, Ca and Cb are 
not. From a practical point of view, Ca and Cb are the coefficients of a high-level linear 
model for Cmire- The values of C'a and Cb depend both on the technology and on the 
area of the circuit: the wiring capacitance per unit length is a technology parameter, 
while the average length of a stem/branch segment depends on the total area. In our 
tool flow, the values of Ca and Cb can be read from the Synopsys technology file used 
for mapping. Each wiring model includes the unit capacitance and the average lengths 
of a stem and branch segment. In addition, a look-up table is provided to associate 
each model with a range of area values. In summary, the capacitance of a wire with 
Npo fan-out branches is estimated as: 

C^wire = Ca{tech, area) + Cb{tech, area)Npo (4) 
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2.3 Estimating RTL Wiring Capacitance 

Equation (4) can be directly applied to estimate RTL wiring capacitance Cwire- The 
wire model (including the values of Ca and Cb tabulated as functions of technology and 
area) is taken from the Synopsys technology library specified by the user. The number 
of fan-out branches Npo is directly obtained from the RTL netlist. The total area is 
computed as the sum of the back-annotated area estimates for all RTL components. 

The only point that needs to be further discussed is the estimation of the area 
associated with each RTL component. In our tool-flow, fast synthesis is automatically 
performed whenever a new power model needs to be constructed and characterized for 
a functional macro. Characterization is based on the results of the power simulation 
of the gate-level implementation of the RTL macro. In principle, the same paradigm 
could be used for area estimates: whenever a new macro is instantiated, fast synthesis 
can be performed to characterize (and back-annotate) its area. On the other hand, 
if the area has already been characterized, the back-annotated value is used directly 
without repeating synthesis and characterization. 

The problem with this process is efficiency. Suppose the designer is using a new 
library (without pre-characterized power/area models) and he/she wants to estimate 
only wiring power to evaluate the impact of sharing. According to the above approach, 
all macros instantiated within the design should be synthesized with the only purpose 
of evaluating area. In many cases, this process is very expensive in terms of CPU-time 
and tool licenses. On the other hand, the dependence of wiring power on the total area 
is a step function (the same wiring model is associated with a range of area values) 
whose accurate evaluation does not require accurate area estimates. 

According to the above observations, we developed a hybrid approach for area 
estimation that realizes a better trade-off between accuracy and performance: 

1. Fast synthesis of a macro is performed only if a power model for the macro needs 
to be characterized; 

2. Whenever a new macro is synthesized the area of its gate-level implementation is 
annotated; 

3. When computing total area, gate-level area estimates are used only if already 
available; 

4. For macros whose area has not been pre-characterized at the gate-level, a high-level 
area estimator is used based only of RTL information. 

The high-level area estimator we use is derived from Rent’s rule. The area A of a macro 
is expressed as a power function of its pin count Njo'- 

^ = cN^o (5) 

Coefficient c and exponent r are the parameters of the model that need to be char- 
acterized. Characterization is based on the knowledge of the actual area of all macros 
that have been synthesized and mapped onto the current library. In the four-step pro- 
cess outlined above, the general estimator is refined (re-characterized) whenever a new 
macro is synthesized (step 2) in order to take advantage of the new area information. 

2.4 Estimating Input Capacitance 

The input capacitance Cin to be associated with a wire is the sum of the input capac- 
itance of all macros fed by the wire. We assume that the input capacitances viewed 
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at the inputs of a macro are computed at the gate-level during characterization and 
stored in an array to be used at the RTL. If back-annotated input capacitances are 
not available (because the macro has never been synthesized) a high-level estimator is 
used similar to the area estimator introduced in the previous subsection. The average 
input capacitance Cin_avg of a macro is assumed to be related to its area and to its pin 
count. Since the area is, in its turn, related to the number of pins, we use the model: 

Cin_avg = dNjQ (6) 

where parameters d and s need to be characterized in order to fit available data. 



3 Multiplexers 

Multiplexers have two peculiarities that make them different from most other macros: 
First, they have a regular structure; second, they are bit-sliced elements (i.e., a n- 
bit macro can be viewed as an array of n 1-bit macros, independently processing 
individual bits). In this section, we exploit the first property to build ad-hoc models 
that improve upon the accuracy of the general-purpose power models developed for 
functional macros, and the second property to make the models scalable with the bit- 
width. Model scaling reduces signihcantly the characterization effort, allowing us to 
characterize (i.e., synthesize and simulate at the gate-level) only 1-bit macros, while 
using the models for arbitrary bit-widths. 

Multiplexers are usually specified as soft macros that can be specialized by the 
designer by setting not only the bit-width {W) of the data inputs, but also the number 
of input ports, the number of control inputs and the encoding used for input selection. 
In principle, the concept of model scaling could be applied to scale the power model of 
a soft macro with respect to all its generics. Generalized model scaling trades off some 
accuracy to save characterization time. In this context, however, we are not investi- 
gating general accuracy-efficiency tradeoffs, rather, we are interested in exploiting the 
bit-sliced structure of MUXes to scale their models with negligible (if any) accuracy 
loss. Hence, bit- width W is the only parameter we consider for scaling. From a practical 
point of view, instances of the same soft macro that differ only for the value of W will 
share the same (scaled) power model, while instances that differ (also) for some other 
generics will be treated as different macros with different power models. The model we 
will derive has the form: 

Power = S{W)P (stats) (7) 

where W is the bit-width, stats represents generic boundary signal statistics, S'(IT) is 
a scaling function and P(stats) is the power model for the l-bit instance of the macro. 
The modeling task is thus partitioned into two sub-tasks: First, deriving a power model 
P(stats) for a 1-bit MUX; second, determining S'(IU) for scaling the model. 

3.1 Preliminary Analysis 

We performed preliminary experiments to verify the disjoint dependence of power on 
I/O statistics and bit-width. We used as benchmark the universal multiplexer taken 
from the Synopsys’ DesignWare library. A larger set of benchmarks was obtained by 
generating different instances of the MUX by specifying different generics. Each bench- 
mark (i.e., each macro with assigned generics) was then synthesized for different bit- 
widths, mapped onto a library characterized for power and simulated for different input 
statistics using Synopsys ’ VSS with DesignPower. 
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Fig. 2. Energy as a Function of Bit-Width, for a 2-Port and a 4-Port Multiplexer. 

From the first set of experiments we observed that the signal probability (i.e., the 
probability that an I/O signal takes value 1) has a negligible impact on the power 
consumption of all benchmarks. Based on this observation, we used only the transition 
probability (i.e., the probability that an I/O signal has a transition) to represent in- 
put/output statistics. From a practical point of view, we performed power simulations 
with input transition probabilities (denoted by Di„) ranging from 0.01 to 0.99, while 
keeping the average signal probability at 0.5. 

Figure 2 plots on the left the power consumption of a 2-port MUX as a function 
of W. Piece-wise linear curves have been obtained by connecting points corresponding 
to the same input statistics. Hence, curves are parameterized with respect to the input 
statistics, namely Di„. 

From the plots we notice that: i) The relative distance between the curves is al- 
most independent of the bit- width; ii) The relation between bit- width and energy is 
almost linear. Observation i) suggests that the dependence on the bit-width can be 
de-coupled from the dependence on input statistics, thus motivating the development 
of power models of the form of Equation 7. Observation ii) shows the benehcial effect 
of the bit-sliced nature of the macros: The power consumption of an n-bit macro is ap- 
proximately n times the power consumption of the 1-bit macro evaluated for the same 
input statistics. A similar behavior has been observed for general datapath macros [8] . 

Unfortunately, there are exceptions to this linear dependence, as shown in the right- 
hand diagram of Figure 2, that plots the same curves for a 4-port MUX. The reason for 
this non-linearity, that violates the bit-slice composition rule, can be found by looking 
at the gate-level implementation of the macros. Though it is always possible to build 
a n-bit MUX from n 1-bit components, width-dependent design choices can be taken 
by the synthesis tool to optimize the implementation, resulting in different netlists. If 
this is the case, we cannot rely on linearity and we need a refined scaling criterion. 

3.2 Scaling 

To further verify the disjoint dependence between input statistics and bit-width, we 
analyzed the behavior of: 

— Ri{W\,W 2 , Din) = P{Wi,Di„)/P{W 2 ,Din), representing the ratio between the 
energy consumption of two macros with different bit-widths, for the same input 
statistics; 

— R 2 {W,Dini,Din 2 ) ~ P{W, Dini) / P {W, Din 2 ) , representing the ratio between the 
energy consumption of the same macro for different input statistics. 
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Figure 3 shows, on the left, the behavior of Ri as a function of Di„ (curves are 
parameterized on the values of W\ and W2) and on the right the behavior of ratio 
i?2 as a function of W (curves are parameterized on the values of Dini and Din2)- 
It is apparent that both R\ and R2 are almost independent of Din (their standard 
deviation being 0.029% and 0.0053%, respectively). The scaling factor that has to be 
applied to a power model characterized for a reference macro with bit-width ILVe/ in 
order to estimate the power consumption of a different instance of the same macro 
with bit- width W, is nothing but ratio R\ computed for Wi = W and IV 2 = Wref, 
that is, S'(VF) = R\{W,Wref)- We refer to this scaling as analytical, since it does not 
require the synthesis of the macro to be scaled. 

Under the ideal bit-slice composition assumption, the value of Ri can be directly 
obtained at no cost as the ratio between the two bit-widths: S(IU) = W/Wref- Though 
in principle we could use Wref = 1 for characterization, we achieved better accuracy 
by using Wref = 8. In general, the larger the value of Wref, the lower the scaling factor 
that multiplies the inherent characterization noise. 

If the bit-slice assumption doesn’t hold (as shown in Figure 3), using the analyt- 
ical scaling factor WjWref may lead to unacceptable errors. In this case, we resort 
to fast synthesis and simulation (for a fixed value of Din) of the scaled macro in 
order to obtain the term P(W,Din) used to compute S(VF) as Ri{W, Wref , Din) ~ 
P{W, Din) / [Wref , Din). We refer to this scaling as synthesis based. 



' W1=4 

W1=8 ; W2=4 

W1=32;W2=8 

W1=4 ; W2=16 




Fig. 3. Plots of Ri and R 2 for a 2-port Multiplexer. 



3.3 Power Model 

The model we use to represent the dependence of the power consumption of a multi- 
plexer on boundary statistics is based on the following observations: i) Signal probabil- 
ities have a negligible impact (and can be neglected); ii) The activity at each I/O port 
is positively correlated with power consumption (and should appear in the model); Hi) 
All data inputs have similar fanout cones (and similar effect on power consumption); 
iv) All data outputs have similar fanin cones (and similar correlation with internal 
power). These observations lead to the following power model for P{stats): 

Pi^Stats) = CinDin -\- CsDs -\- CoutDout (8) 

where Din is the average transition probability of data inputs, Ds is the average tran- 
sition probability of the selection signals, Dout is the average output activity, and Cin, 
Cs and Cout are fitting coefficients to be determined by regression analysis. 
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4 Experimental Results 

4.1 Wiring 

Our approach allows the evaluation at the RTL of gate-level wire models. Hence, when 
all RTL components have been pre-characterized for area and input capacitance, the 
accuracy provided by our model is, by construction, the target gate-level accuracy. 
What needs to be tested is the approximation introduced by the lack of gate-level 
information about the area and input load of (some of) the design components. 

We present three sets of results that assess the accuracy of the area estimates 
provided by Equation 5, the input capacitance estimates provided by Equation 6, and 
the overall wiring power estimates provided by our approach. 

For our experiments we considered all the DesignWare macros, with multiple in- 
stances of each soft macro. A first area model is obtained by individually characterizing 
each type of macro. Another model is obtained according to the value of the exponent 
r used in the generic model; in particular, we built three clusters of macros with similar 
A-Njo relations, and we associated a unique area model with each cluster. Finally, we 
characterized a general, unified area model for all macros. We use the terms specific, 
clustered and general to denote the three types of area models above. The accuracy 
provided by these models is reported in the left hand-side of Table 1, expressed in terms 
of average relative error and standard deviation. As expected, the more general is the 
model, the less accurate are the estimates it provides. In fact, the relation between pin 
count and area strongly depends on the type of macro. For instance, multipliers have 
a quadratic relation (r = 2) while adders have a linear relation (r = 1). Trying to use 
the same model for all macros impairs accuracy. The same experiment was performed 
for the estimation of the input capacitances. Results as similar as those obtained for 
area estimates, and they are summarized on the right hand-side of Table 1. 



Table 1. Area and Input Wiring Capacitance Estimate Results. 



Model 


1 Area \ 


\ Input Wiring Capacitance\ 


Avg. Error 


Err. St. Dev. 


Avg. Error 


Err. St. Dev. 


specific 


1.37 


1.18 


1.16 


1.23 


clustered 


23.51 


97.37 


7.46 


44.93 


general 


66.26 


77.03 


31.53 


31.44 



Finally, we tested the entire approach on the case study of Example 1. When area 
and input capacitance have been pre-characterized at the gate-level for all the adders 
and the MUXes instantiated within the elliptic filter, our approach provides the same 
estimates for wiring power reported in Example 1. In other terms, the RTL accuracy 
is the same as the gate-level one. 

If no pre-characterization is performed and the general models are used to estimate 
area and input capacitance, the average error is around 31%. If specific area and ca- 
pacitance models are used for adders and MUXes, the average error on wiring power 
estimates reduces to 6%. 
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4.2 Multiplexers 

We tested our power model on different MUXes obtained by specifying different port 
numbers and encoding styles for the universal multiplexer taken for the Synopsys’ 
DesignWare library. Each MUX was characterized using a reference bit-width of 8 
{Wref = 8). For characterization, the 8-bit instance of the MUX was first synthesized 
and mapped on a gate-level library characterized for power. Gate-level power simulation 
was then repeatedly performed by DesignPower for different input statistics to obtain 
data for linear regression. Least square fitting was finally performed to fix the three 
coefficients of the linear equation. 

The accuracy was evaluated through concurrent RTL and gate-level simulation for 
25 input streams. The average error obtained for the reference bit-width (i.e., without 
scaling) was below 10% for all benchmarks, with a standard deviation around 5%. 

The accuracy loss caused by scaling is reported in Figure 4 for 2-port and 4-port 
MUXes. Two series of results are reported on each graph, that refer to the analytical 
and synthesis-based scaling. As expected, synthesis-based scaling improves upon the 
accuracy of analytical scaling. The advantage is negligible for 2-port MUXes (because 
of the good linear relation between power and bit-width), while it is remarkable for 
4-port MUXes with bit-width of size 1 and 2, i.e., when the power consumption does 
not scale linearly with the bit-width. 
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Fig. 4. Experimental Results on a 2-Port and 4-Port Multiplexer. 



5 Conclusions 



We have presented RTL power models for the steering logic (multiplexers and wiring) 
that are usually not accounted for during RTL power estimation because they are not 
explicitly instantiated into the RTL description, in spite of their potentially high impact 
on the total power budget. The power model for MUXes can be used for special complex 
types with re-configurable numbers of data bits and number of ways. The interconnect 
model is obtained by empirically relating capacitance to area, that is either estimated 
by means of statistical models or extracted from back-annotation information available 
at the gate level. Experimental results have demonstrated the good accuracy of the 
models, yielding estimation errors with respect to gate-level estimates below 10%. 
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Abstract. In this paper, a power management technique based on dy- 
namic frequency scaling is proposed. The proposed technique targets 
digital receivers employing adaptive sampling. Such circuits over-sample 
the analogue input signal, in order to succeed timing synchronization. 
The proposed technique introduces power savings by forcing the re- 
ceiver to operate only on the “correct” data for the time intervals during 
which synchronization is achieved. The simple architectural modifica- 
tions, needed for the application of the proposed strategy, are described. 
As test-vehicle a number of FIR filters, which are the basic components 
of almost every digital receiver, are used. The experimental results prove 
that the application of the proposed technique introduces significant 
power savings, while negligibly increasing area and critical path. 



1 Introduction 

Nowadays, sophisticated handsets, with wireless communication capabilities, 
have invaded the world market. In such applications low power consumption 
is of great importance to allow for extended battery life but also to reduce the 
packaging and cooling related cost. 

One of the most efficient low-power techniques, applicable at all levels of 
abstraction of the design flow, is the dynamic power management [1,2]. The most 
common approach to dynamic power management is to selectively shutdown a 
resource, when it performs useless operations. Techniques based on the previous 
concept have been proposed in [2, 3, 4, 5, 6]. 

In this paper the power management concept is applied in a class of dig- 
ital receivers, namely digital receivers employing adaptive sampling. Adaptive 
sampling is very commonly met in a variety of receiving applications but espe- 
cially into wireless telecommunication terminals. An analysis of the behavior of 
such receivers indicates that the percentage of data, contained in the incoming 
stream, that is necessary to be processed for the correct operation of the receiver 
depends on whether or not synchronization is achieved. A novel technique based 
on dynamic frequency scaling, that introduces power savings by forcing the re- 
ceiver to operate only on the “correct” data for the time intervals during which 
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synchronization is achieved, is presented here. The application of the proposed 
technique is rather simple, since it requires minor modifications with respect to 
conventional architectures, while it introduces significant power savings. 

The rest of this paper is organized as follows: Section 2 is dedicated to basic 
background. In section 3 the target architecture models are described. In sec- 
tion 4 the proposed power management technique is presented. In section 5 the 
proposed technique is applied in demonstrator applications and its effect on the 
various design parameters is analyzed. Finally, in section 6 some conclusions are 
offered. 

2 Basic Background 

In a typical digital telecommunication system, the transmitter modulates the 
binary data into phase, and/or frequency, and/or amplitude, differences of an 
analogue signal (carrier) [10]. Symbol frequency (fs) is the ratio of the number 
of transmitted symbols per second. For the receiver the symbol frequency is 
known. However, the exact instance, that the modulated input signal must be 
sampled, is not known. Furthermore, in the general case the receiver knows the 
width of the data bursts, but the starting position of them is not known. Symbol 
timing synchronization is the process of deriving at the receiver timing signals 
indicating where in time the transmitted signals are located [9,10]. If symbol 
timing synchronization is not achieved, then even a small shifting in time of the 
sampling instances can result to receive erroneous data. Frame synchronization 
is the process of locating at the receiver the position of a synchronization pattern 
(marker), periodically inserted in the data stream by the transmitter [9,10]. 

Adaptive sampling through oversampling is a very commonly met synchro- 
nization method in mobile telecommunication systems [e.g.7, 8], since it elimi- 
nates the need for a power hungry, analog VCO (Voltage-Controlled Oscillator) 
and provides both symbol timing and frame synchronization in one step [9] . Ac- 
cording to adaptive sampling, the sampling instance during the symbol period is 
selected separately for each data burst. Oversampling is a mechanism employed 
to choose the correct sampling instance during the symbol period for each data 
burst. Specifically, instead of sampling the analogue input signal once, the input 
signal is sampled N (oversampling ratio) times during each symbol period. This 
way, instead of one input data stream, we have N input data streams: one input 
data stream per sampling instance during the symbol period. A block responsible 
for synchronization decides which of the N input data streams corresponds to 
the correct sampling instance. The input stream that corresponds to the correct 
sampling instance is the one that includes the synchronization pattern. In this 
way symbol timing and frame synchronization are jointly performed. 

3 Target Architecture Model 

Adaptive sampling through oversampling imposes one of the following receiver 
architectures styles: 




Reducing Power Consumption through Dynamic Frequency Scaling 



49 



1. The receiver is designed in the same way as it would be designed in the 
case that it process only one input stream and after that the data registers 
are replaced with N-position shift-registers. In this way the data-path must 
operate at N-times the symbol frequency, in order to produce output at 
symbol frequency. 

2. N-parallel identical data paths operating at the symbol frequency are imple- 
mented, one for each input stream. 

3. P-parallel identical data-paths are implemented each one of them processing 
N/P input streams and operating at N/P times the symbol frequency. 



Area constraints usually prohibit the use of the second architecture style. So, 
in order to be realistic only the first and third design styles are studied here. 

The following analysis assumes that two signals are generated with in the 
receiver: The first is the signal TS indicating that timing synchronization is 
achieved. This signal is usually an interface signal to the system of which the 
receiver is a component. The second signal is the correct _sample that denotes 
which of the N input streams is the one that corresponds to the correct sam- 
pling instance. This signal is present in receivers incorporating adaptive sampling 
though oversampling, since it is needed in order to select the input stream that 
will be fed in the output. 

An abstract model of the architecture style is illustrated in fig. 1. The 
functional units (FUs) implement the digital filtering and the demodulation al- 
gorithm. In order to reuse the resources for all input streams and perform the 
operations between samples belonging to the same input streams, N-stage shift- 
registers are used to store the data at the inputs and output of each resource. 
The shifters are clocked with N times the symbol frequency in order to produce 
output with symbol frequency. The output of the receiver can be in high impen- 
dence when Ts indicates that there is no timing synchronization. When timing 
synchronization is established the correct^sample signal selects the stream that 
must be fed in the output and the output register is clocked with symbol fre- 
quency. In cases of pipelined FUs, the data of the same stream are not stored 
in the same stage at the input and output shift registers of the FU. Specifically, 
for the stream J, if the data are positioned in the stage of the output shift 
register of an L-stage pipelined FU, then the data that correspond to the same 
data stream are mapped at the input shift register of the same FU according 
to the equation: (K + L)modN . Thus, in order to perform operations between 
samples of the same stream, special care must be taken. For example in fig.l, 
lets name the output of FUl /(A„) and assume that FU2 is a multiplier with 
pipeline depth=2, which performs the operation /(A„) x A„_i. In any case the 
f{Xn) is fed to FU2 by the Nth stage of the shift register SR-2. In order for 
the FU2 to perform the multiplication between samples of the same stream, the 
sample X„_i must be fed to FU2 by the {N + 2)modN = 2”*^ stage of the shift 
register SR-1. The effect of the pipelined FUs is ignored in the figures, for clarity 
reasons, and it won’t be referenced again during the rest of this paper, due to 
space limitations and since it is almost straightforward. 
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Fig. 1. Architecture Model (1). 



An abstract view of the third architecture style is given in fig. 2. Lets assume 
that P parallel data paths are used and that the sample ratio is N. According 
to the third architecture style, the input stream (containing N different sample 
streams) is demultiplexed in time and each of the derived (P) streams (contain- 
ing N/P different sample streams) is fed to the inputs of one of the P parallel 
data-paths. Alternatively, P parallel ADCs can be used. Each of the parallel data- 
paths operates at N /P times the symbol frequency and is implemented based on 
the principles of the first architecture style. Again, the output of the receiver can 
be in high impendence when TS indicates that there is no timing synchroniza- 
tion. During the interval that the receiver is synchronized the correct ^sample 
signal selects the stream that must be fed in the output. Also here, the output 
register is clocked with symbol frequency. The latter architecture model actually 
consists of P parallel data-paths compatible to the first architecture model (each 
one operating at N/P times the symbol frequency). For this reason, the rest of 
this paper focuses on the first architecture model. The proposed technique is 
applied in an analogous way on the third architecture model as well. 

4 Dynamic Frequency Scaling 

Adaptive sampling through oversampling imposes that the whole receiving algo- 
rithm is computed on N input data streams, each one of which corresponds to 
a different sampling instance during the symbol period, while only one stream 
corresponds to the correct sampling instance. This introduces a significant power 
overhead. The power overhead cannot be avoided for the time intervals during 
which synchronization is not achieved. However, after the detection of the syn- 
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Fig. 2. Architecture Model (3). 



chronization pattern and up to the end of the frame the power overhead can be 
removed by operating only to the data that correspond to the correct sampling 
instance. 

For this purpose a frequency scaling technique is proposed in this paper. 
Specifically, the operation frequency can be reduced to symbol frequency after 
the synchronization pattern detection and up to the end of the frame. During 
this time interval, the receiver is forced to operate only on the input stream that 
corresponds to the correct sampling instance. After the end of each frame the 
receiver operates again at oversampling frequency. 

From the architecture point of view, some modifications of the original archi- 
tectures model are needed in order to preserve the correct functionality of the 
receiver. In fig. 3, the proposed first architecture model is illustrated. The main 
differences between the original (fig. 1) and proposed architecture model (fig. 3) 
are the following: 

1. The shift-registers of the original architecture model are replaced, in the 
proposed architecture model, with shift-registers, whose output is either their 

or their 1®* stage. The signal Tg (indicator of synchronization) selects 
which of the two stages is fed as output. 

2. In the proposed architecture model, the clockselect module is added while 
the out-clock module is removed. The clockselect module produces the 
select-dock, which is the clock with symbol frequency. When the select-dock 
is used to trigger the ADC, then the analogue input is sampled only at the 
instances indicated by the correct-sample signal. In fig. 3 the waveform of 
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Fig. 3. Proposed Architecture Model. 



the select-dock, when the correct-sample signal indicates that the correct 

input stream is the streami, is given. 

For the proposed architecture model, when the signal Ts indicates that syn- 
chronization is not achieved, the clock with N times the symbol frequency is 
used. Additionally, the stage of shift-registers is fed to their output. For 
this case, the proposed architecture model operates exactly as the original one. 
However, when Ts indicates that synchronization is achieved, the select-dock 
(symbol frequency clock) is used throughout the receiver. The select-dock wave- 
form is such that forces the ADC to sample only at the correct instance during 
the modulation period. Since the data path should manipulate only one stream, 
there is no need to shift the data on the inputs and outputs of the functional 
units. Thus, the 1st stage of the shift registers is used as a single register in order 
to feed and store the inputs and outputs of the functional units. The rest N-1 
stages of the shift-registers are bypassed and for this reason their clock can be 
disabled. The same are valid for the receiver’s output shift-register. 

5 Experimental Results 

As stated earlier in this paper, the proposed frequency scaling technique is ap- 
plicable in receivers employing adaptive sampling through oversampling. One of 
the main parts of such receivers is the digital filtering stage. For this reason, 
as demonstrator application Finite Impulse Response (FIR) filters are chosen. 
Four- and eight-taps FIR filters are implemented. A low-power implementation 
of the FIR filters is considered, where multiplications are replaced with shift-add 
operations. For this reason no resource reuse is considered. The filter coefficient 
bit- width is 10 bits, while the symbol bit- width is 9 bits. For each FIR filter three 



Reducing Power Consumption through Dynamic Frequency Scaling 



53 



#taps 


N 


Power (mW) 


Diff. 

(%) 


Conv. 


Prop. 


4 


4 


10.405 


3.320 


68.09 


8 


21.549 


6.502 


69.83 


16 


47.460 


12.937 


72.74 


8 


4 


27.032 


8.478 


68.64 


8 


54.945 


16.630 


69.73 


16 


116.328 


33.166 


71.49 



#taps 


N 


Power (mW) 


Diff. 

(%) 


Conv. 


Prop. 


4 


4 


10.403 


3.354 


67.76 


8 


21.322 


6.539 


69.33 


16 


47.240 


12.929 


72.63 


8 


4 


26.997 


8.569 


68.26 


8 


54.468 


16.747 


69.25 


16 


115.754 


33.115 


71.39 



Table 1: Power for Ts/ Tp =1/128 Table 2: Power for Ts/ Tp =2/128 



#taps 


N 


Power (mW) 


Diff. 

(%) 


Conv. 


Prop. 


4 


4 


10.287 


3.381 


67.13 


8 


21.101 


6.506 


69.17 


16 


46.647 


12.919 


72.30 


8 


4 


26.735 


8.722 


67.38 


8 


53.869 


16.709 


68.98 


16 


114.308 


33.019 


71.11 



#taps 


N 


Power (mW) 


Diff. 

(%) 


Conv. 


Prop. 


4 


4 


10.043 


3.391 


66.24 


8 


20.572 


6.532 


68.25 


16 


45.432 


12.882 


71.65 


8 


4 


26.064 


8.764 


66.38 


8 


52.472 


16.739 


68.10 


16 


111.204 


32.793 


70.51 



Table 3: Power for Ts/ Tp =4/128 Table 4: Power for Ts/ Tp =8/128 



#taps 


N 


Area (mils^) 


Diff. 

(%) 


Conv. 


Prop. 


4 


4 


5344.91 


5456.64 


2.09 


8 


5957.43 


6058.57 


1.70 


16 


7181.44 


7270.18 


1.24 


8 


4 


10802.74 


11011.79 


1.94 


8 


11832.08 


12022.89 


1.61 


16 


13895.81 


14602.52 


3.19 



#taps 


N 


Critical Path (ns) 


Diff. 

(%) 


Conv. 


Prop. 


4 


4 

8 

16 


33.60 


34.04 


1.31 


8 


4 

8 

16 


38.76 


39.20 


1.14 



Table 5: Area measures 



Table 6: Critical path measures 



different oversampling ratios are considered, namely 4, 8 and 16. The frame is 
considered to consist of 128 symbols. Finally, four different values for the width 
of the synchronization pattern are considered, namely 1, 2, 4 and 8 symbols. 

For each different FIR filter configuration, two implementations are consid- 
ered: one according to the first architecture model of section 3 and one according 
to the proposed architecture model. For the FIR implementation, the SYNOP- 
SYS and Mentor-Graphics CAD tools were employed and the 0.6 micron AMS 
cell-library was used for mapping. For each different FIR filter the power con- 
sumption, the area and the critical path were measured. Power measurements 
were acquired with toggle-count during logic-level simulation under real delay 
model and capacitance estimates provided by the CAD tools, assuming 5V power 
supply and 4K highly-correlated, 9-bit, valid input vectors. The later means that 
if the oversampling ratio is N then the number of input vectors used is AK x N . 
The AK X {N — I) vectors that correspond to the wrong sampling instances were 
generated with a random variation from the corresponding valid vector in the 
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range of ±10%. For area and critical path measurements the reports provided 
by the CAD tools were used. 

Tables 1 up to 4 illustrate the power consumed for both the original and 
proposed implementations for four different cases. Each case corresponds to a 
different ratio Ts/Tp of synchronization pattern width to frame width. The ratio 
Ts/Tp determines the relation between the sizes of the time intervals operating 
at oversampling and symbol frequency. So it was expected that as the later ratio 
increases, power savings decrease. Experimental results indicate that the effect 
of the ratio Tg /Tp on power savings is weak. For example, the power savings for 
the case that Ts/Tp = 1 are on average 1.57% greater than the power savings 
for the case that Ts/Tp = 8. 

Additionally, from tables 1-4, it can be observed that the amount of power 
saved by the proposed technique, mainly depends on the oversampling ratio. 
This is rational, since the oversampling ratio determines the difference between 
the power dissipated for the time intervals operating at oversampling and symbol 
frequency. Furthermore the percentage of power savings does not seem to depend 
on the design size since the results are very close for both 4-taps and 8-taps FIRs. 
In any case, the proposed architecture model consumes significantly less power 
(on average 69.43%) than the original architecture model. 

Tables 5 and 6 illustrate the area and critical path measures respectively, 
for both the original and proposed implementations. As it can be observed, 
the area overhead, introduced by the multiplexers, the additional control logic 
and interconnections of the proposed architecture model, is on average 1.96%. 
Furthermore, the proposed architecture increases the critical path of the original 
design by the delay of the multiplexer that is added at the output of the shift 
registers. This increase is less than 0.5ns in any case. The overheads in area and 
critical path introduced by the proposed technique are considered to be negligible 
compared to the corresponding power savings. It must be stressed here, that the 
latency of the design is not affected by the proposed technique. 

Finally an interesting observation is the following: the power consumption of 
the original case when = 4 and 8 is greater than the power consumption of the 
proposed case when N = 8 and 16 respectively. This means that the application 
of the proposed technique enables the use of higher oversampling ratios (which 
means higher reception quality), while not exceeding the initial power budget. 



6 Conclusions 

This paper focuses on the strategy of power management through dynamic fre- 
quency scaling, which is based on the adaptation of operation frequency to the 
computational load. A technique for the application of the above strategy on re- 
ceiver applications, employing adaptive sampling through oversampling, is pro- 
posed. With the proposed technique, the operation frequency of the receiver 
can be reduced for the time intervals during which timing synchronization is 
achieved. The architectural modifications needed for the application of the pro- 
posed technique are described. The proposed technique is applied in a number of 
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FIR filters, which are part of every digital receiver, and the experimental results 
prove that significant power savings are introduced, with a very small area and 
critical path overhead. 
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Abstract. A framework for high-level power estimation dedicated to 
the design of signal processing architectures is presented in this work. 
A strong emphasis lies on the integration of the power estimation into 
the regular design-flow and on keeping the modeling overhead low. This 
was achieved through an object-oriented design of the estimation tool. 
Main features are; an easy macromodule extension, the implementation 
of a Verilog HDL subset, and a moderate model complexity. Estimation 
results obtained using the framework for development of a discrete cosine 
transform compare to the deviation of power consumption imposed by 
their data dependency. 



1 Introduction 

The emerging market for wireless communication and mobile computing forces 
manufacturer and designers to pay special attention to low-power aspects of 
their products. Additionally, due to the short life cycle of these mobile devices, 
time-to-market has become a crucial factor. This leads to the requirement of 
power estimation on a high level of abstraction in the design process. 

In this paper, an object-oriented framework is presented which addresses 
the problem of finding a low-power realization of signal processing architecture 
alternatives in an early design phase with considering the time-to-market issue. A 
high-level power estimation tool build around a centrally macromodule database 
was developed which integrates into the normal design-flow. This was achieved 
by the implementation of a subset of the Verilog hardware description language 
in the power estimation tool. The design-time factor was addressed through an 
easy extensibility of the framework and a moderate model complexity. 

Approaches for high-level power estimation often use a two stage design-flow 
[1] - [4]: In a characterization phase accurate simulations on gate- or transistor- 
level netlists for the implementation of a macromodel are performed. From these 
simulations, power consumption coefficients for a later power estimation phase 

* The work presented is supported by the German Research Foundation, Deutsche 
Forschungsgemeinschaft (DFG), within the research initiative “VIVA” under con- 
tract number PI 169/14. 
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are extracted. For an estimation on a higher level (RT-, architecture-level), sig- 
nal characteristics on the module interfaces are taken and used to calculate the 
power consumption from the coefficients previously extracted during the char- 
acterization phase. All these methods have in common that a module or a class 
of similar modules has to be characterized once and can afterwards be used in 
the high-level design process as often as necessary. Adequate precision in the 
estimation process is ensured through the accurate characterization on a lower 
level of abstraction. 

During the design process, the distinction between characterization and eval- 
uation phase has one drawback: Each module has to be characterized once be- 
fore its use. Therefore, the model creation and characterization of a new module 
should be easy and reasonable fast. Reuse of already created models is of great 
interest for a rapid development cycle. Furthermore, the power estimation should 
integrate easily in the design process. 

The rest of the paper gives a short review of two high-level power estimation 
models and a description of the implemented power estimation framework with 
macromodels exploiting the Dual Bit Type (DBT) method. As an example, a 
signal processing application is given and the results are discussed. 

2 Power Macromodeling 

For high-level power estimation with the two stages characterization and esti- 
mation, different modeling methods can be applied. One method treats macro- 
modules as black boxes with no knowledge of the functionality and realization of 
the module [1]. Another approach is the white box modeling, where an indepth 
knowledge of the structure and functionality is required [2]. Other methods lie 
between both approaches and require a moderate knowledge of the module’s 
functionality [3]. 

As mentioned earlier, creation and characterization of new modules during 
the design process should require only moderate effort in model creation and 
simulation time. Therefore, simulation intensive models and models where an 
indepth knowledge of structure and functionality is required may not be accepted 
by circuit designers in time-to-market critical projects. 

In [1], a macromodel approach relying on probabilistics for combinational 
CMOS circuits is proposed. Their model captures four input/output signal 
switching statistics leading to a four dimensional look-up table for the estima- 
tion of power consumption. As the four indices result from continuous values, 
a discretization has to be performed prior table look-up. With their choosen 
discretization, the table for one module comprises 10'^ entries. Filling this table 
during the characterization process is a time consuming task and takes hours 
to days to generate the coefficients for just one module. On the other hand, the 
independence of the characterization phase from the type of module is advanta- 
geous. 

The authors of [3] use an approach where number formats and word lengths 
are considered. They reported first the so called Dual Bit Type (DBT) method. 
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which especially aims at signal processing applications. The main idea behind 
the model is to identify regions of similar behaviour within the data words of 
module in- and outputs. One region behaves like the sign in a number represen- 
tation. Another region of the word has random data characteristics. For modules 
modeled using the DBT method, up to 73 coefficients and an additional archi- 
tecture parameter matrix have to be determined during the characterization. In 
[4], an approach to refine the estimation accuracy for macromodels with input 
signals differing from the DBT data model is given. It uses a training phase for 
model improvement. 

If a fast architecture exploration without the neccessity to get accurate figures 
for the absolute power consumption is required, all overhead should be reduced 
to get a first idea which architecture implementation is a candidate for the final 
design. Additionally, the influence of the processed data on the power consump- 
tion can be remarkable high. This in turn shows, that there is often no need to 
achieve the highest possible estimation accuracy, as the power consumption of 
the application is not a single value but a fuzzy one. 

With the given constraints for a fast creation of new macromodels during 
the design phase, the DBT method is better suited than the pure probabilistic 
model in [1]. In the following, the model derived from the original DBT method 
and its application for rapid prototyping of video signal and image processing 
architectures is described. 

2.1 DBT Macromodel 

The choosen DBT macromodel is especially beneficial for signal processing ar- 
chitectures, which are built on numerical operations performed on data words. 
It is quite evident that the least significant bits (LSBs) in a data word tend to 
behave randomly, and can be describe by a uniform white noise (UWN) pro- 
cess. In contrast, the most significant bits (MSBs), in two’s-complement number 
representation, correspond to sign bits. The signal and transition probabilities 
in the MSB region do not behave like a random process due to the temporal 
correlation of the data words. 

In Fig. I, the decomposition of a data word in the dual bit type model into 
sign- and random-like data regions is shown. The high-order bits from the MSB 
down to breakpoint BP\ is the sign region. Starting at the LSB up to BPO is the 
UWN region. The intermediate region between BPl and BPO can be modeled 
quite well by linear interpolation of the sign and UWN activity. 

The definitions of the breakpoints BPl and BPO is given in (1) and (2), 
which can be computed from the word-level statistics variance(cr^), and temporal 
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Fig. 1. Decomposition of data word in sign and UWN region 
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Fig. 2. Overview of power estimation framework 



correlation (p): 

BPl = log2 6cr (1) 

BPQ = Ioq2 a + log2 [\/l - p'^ + \p\/8] (2) 

The definition of BPl found in [3] additionally uses the mean value (p), 
whereas the choosen definition (1) reflects the one reported in [5]. For modules 
with two inputs and more, the original method introduced misaligned break- 
points for the case where the breakpoints of the inputs differ. E.g., this distinction 
led not to a relevant model improvement. Omiting the misaligned breakpoints 
from the model results in a much faster characterization process. In practice, the 
characterization of a module with two inputs speeds up by a factor of approxi- 
mately five. 

Here, in this case of a module with two inputs a and b, the calculation of the 
resulting breakpoints is performed by the equations (3) and (4). Where BPHa,b 
beeing the breakpoints BPQ of input a and 6, respectively, and BPla^b both 
breakpoints BPl: 



BPO = max(HP0a, BPOb) (3) 

BPl = max{BPla, BPh) (4) 

3 Power Estimation Framework 

An overview of the high-level power estimation framework is shown in Fig. 2. 
Central elements of the framework are the module libray and the power database. 
The module library consists of parameterizable high-level module descriptions 
required for the design of the architecture. In the power database, coefficients 
for the computation of the power consumption of each module are stored. While 
the module descriptions in the module library are technology independent, the 
power coefficients in the power database are especially computed for the target 
technology. 

The design of a signal processing architecture follows the usual design flow 
for high-level designs: In a first step the functional view of the architecture 
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is created, where the circuit description is based on modules taken from the 
module library. In case a module is not yet stored in the library, it has to be 
functionally modeled and characterized for the power estimation process. Then, 
the design can be verified through functional simulations. The following power 
estimation takes the high-level architecture description, performes a simulation 
with the same signals applied to the design during the functional verification, and 
calculates the power consumption. In addition, area estimates and timing delays 
based on the module’s worst case path are reported for architecture comparisons. 



3.1 Characterization 

Before using a module in the design process for the first time, it has to be 
characterized for determining the power coefficients for a given technology. In 
a first step, the module is implemented as a Verilog high-level description with 
parameterizable input and output port sizes and, if applicable, for parameter- 
izable implementation architectures. This high-level description is mapped for 
the required port sizes and implementation onto the target technology using a 
synthesis tool, e.g., the Synopsys Design Compiler. Both module views, the high- 
level and the technology-mapped, are stored in the module library for module 
characterization and for the verification of the circuit during the architecture 
design phase. 

The module characterization itself uses a gate-level power simulator. Sim- 
ulation of the module is performed with specifically generated stimuli pattern. 
Characteristic power coefficients obtained from the simulation are stored in the 
power database. Area and delay timing estimates are directly taken form the 
technology mapping and also stored in the database. 



3.2 Estimation 

The estimation part of the framework reads the high-level architecture descrip- 
tion previously created for the verification and simulates the architecture to- 
gether with the input signals. The simulation is cycle accurate and tracks the 
signal statistics cr^, p, and sign activity on the module interfaces. 

The power consumption of the architecture can be computed at any simula- 
tion time using the DBT model described in section 2.1. The power consumption 
PModuie is computed from (5), where Pjj and Ps{SS) are the power coefficients 
for the UWN region and the power coefficients for sign transitions, all taken 
from the power library. N is the number of bits of the input, Njj and Ns are 
the number of bits for the UWN and sign region. 

PModuie = ^PU + ^ ^ ) (5) 

V ss / 



The number of bits in the sign region Ns computes from (7) and comprises 
one half of the bits found in the intermediate region Nj (6). The second part 
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of the linear interpolated region accounts to the number of UWN bits Nu from 

( 8 ): 

Ni = BPl - BPQ - 1 (6) 

Ns = {N - BP\) + Ni/2 (7) 

Nu = {BPQ -h 1) -h Ni/2 (8) 

The total estimated power consumption Protai for the architecture is the 
sum over all Piuoduiei where M is the number of modules: 



M-l 

Protal = ^ PModuleim) (9) 

m—0 



3.3 Implementation 

The implementation of the framework consists of the two parts characterization 
and estimation. 

Controlled by a set of scripts, the characterization process is semi-automatic, 
where one ore more modules are mapped onto the target technology and char- 
acterized in one run. The scripts control the generation of stimuli pattern and 
generate the testbench for the module which is afterwards simulated using a 
Verilog based simulator. Power coefficients resulting from the simulation are 
extracted by the scripts and stored into the power database. 

This power database is implemented on a SQL (structured query language) 
database and allows for concurrent read and write accesses from several host 
computers. 

The power estimation part of the framework is implemented in Java which 
gives several advantages compared to other programming languages: Simple ac- 
cess to SQL databases, strong object-oriented programming and run-time link- 
age. To ease the creation of new modules, the estimation process is implemented 
using an object-oriented design. In Fig. 3, the object hierarchy of the macromod- 
ules is given. Through reuse of already existing components, the framework can 
be easily extended for additional modules. Because Java uses run-time linkage, 
the estimation framework need not to be recompiled when adding new macro- 
modules. Thus, a designer needs not to have access to the sourcecode of the 
framework, but is still able to extend it. 

With the subset of Verilog HDL implemented in the estimation program, the 
same high-level description of the architecture can be used for the verification 
of the design using a Verilog simulator and its power consumption, area, and 
timing estimation. 

4 Application 

As an example for a signal processing application, a one-dimensional discrete 
cosine transformation (DCT) was examined. The 8 x 1-DCT is the basis for a 
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Fig. 3. Object hierarchy of macromodules 



two-dimensional 8x8 cosine transform which is a major part of modern image 
and video compression schemes. 

As the DCT requires several numerical operations to calculate the transform, 
many different implementation alternatives were proposed in the past. The DCT 
architecture under investigation was reported by Zhang and Bergmann in [6]. 
Their variant requires 11 multiplications, 15 additions, and 14 subtractions to 
perform the necessary operations for an one-dimensional transform. 



5 Results 

This section discusses the results achieved with the implemented DBT macro- 
model. First, the model accuracy for stand-alone modules is given. Next, the 
influence of data on the power consumption of an application is investigated. 
Finally, results for the DCT signal processing applications consisting of serveral 
macromodules are discussed. 

All simulations and computations were performed on a Sun Ultra 10 work- 
station with a 440MHz clocked processor and 512MB main memory. 



5.1 Characterization Times 

The time to perform the characterization of the macromodules is given in Ta- 
ble 1. For the modules “Add”, “Sub”, and “Mult” the range of the word size 
for which the module was characterized and simulation time is shown. It can be 
seen that even to characterize the whole set of modules, the time to perform the 
characterization is low. Characterization was performed for a carry-look-ahead 
achitecture of the adders and subtractors, whereas a carry-save-adder implemen- 
tation was choosen for the multiplication. 



Table 1. Characterization times for macromodels 

Add Sub Mult 



word size 


time 


word size 


time 


word size 


time 


4 ... 32 


161 sec 


4. . .32 


173 sec 


4. . . 16 


214 sec 
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Table 2. Model accuracy for addition, subtraction, and multiplication macromodules 
depending on the input word size 

Add Sub Mult 



size 


rme 


rmse 


size 


rme 


rmse 


size 


rme 


rmse 


8 


6.4% 


7.3% 


8 


5.7% 


6.9% 


8 


30.9% 


30.7% 


12 


5.1% 


5.6% 


12 


2.6% 


3.6% 


10 


19.9% 


20.8 % 


16 


3.2% 


3.6% 


16 


1.9% 


2.7% 


12 


13.8% 


14.8 % 


20 


2.5% 


2.8% 


20 


1.7% 


2.2% 


14 


10.9% 


11.9% 


24 


1.9% 


2.1% 


24 


0.9% 


1.5% 


16 


8.8% 


9.7% 



Table 3. Simulation times for model accuracy verification 



Module 


gate- 

total 


level 

mean 


high- 

total 


-level 

mean 


Add 

Sub 

Mult 


11.5 hrs 

13.1 hrs 

46.2 hrs 


3.6 sec 
4.1 sec 
14.4 sec 


16.0 min 
16.0 min 
16.0 min 


83.3 msec 
83.3 msec 
83.3 msec 



5.2 Model Accuracy 

To verify the model accuracy, single modules were simulated using gate-level 
netlists and the estimation framework. For each module, different experiments 
with different pseudo-random normal-distributed sequences and varying statis- 
tical characteristics and input word sizes were performed. 

The experiments used 5 different input word sizes, 4 variations of the mean, 
3 different standard deviations, and a set of 4 temporal correlation values. In 
total, each of the macromodels for addition, subtraction, and multiplication was 
simulated 11.520 times using 10.000 input patterns on each input port. 

In Tab. 2, results for the experiments concerning these three macromodules 
are given. For each module, the word size, relative mean error (rme), and relative 
mean square error (rmse) is shown. For the modules, the same implementation 
architecture was taken as during the characterization. 

The simulation times for the model accuracy experiments are shown in Ta- 
ble 3. It took several hours up to days to simulate the experiment on gate-level 
netlists. The average for one simulation is in the order of seconds for the three 
types of modules. But it should be considered that the simulation time on gate- 
level netlists depends on the circuit complexity, thus modules with smaller word 
sizes simulate faster than the same module with a larger word size. In con- 
trast, the run-time of the high-level estimation framework is module and word 
size independent and results in 16 min simulation time for all runs. Thus, one 
simulation is completed in 83.3 msec. 



5.3 Data Influence 

Simulations on the gate-level netlist of the DCT with different input sequences 
were performed to analyze the effect of data dependency on the power con- 
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Table 4. Deviation of power consumption for gate-level simulation of DCT with dif- 
ferent video sequences 

Add Sub Mult 

min max mean 
9.1% 35.5% 23.0% 



sumption. These simulation results were obtained using the video test sequences 
“coastguard”, “news”, and “weather”. 

In Tab. 4, the deviation of power consumption is given. For addition, sub- 
traction, and multiplication the minimum, maximum, and mean deviations cal- 
culated by (10) are given. 

( 1 \na,y:.N sequence Pm oduiein^i sequence)) 

— / 7 IV 

M mvaiy sequence Pm oduiepn, sequence)) 

From these results, it can be seen that the influence of the processed data on 
the power consumption is remarkable high. Thus, a deviation of the estimated 
power for a certain data stream is tolerable. 

5.4 DCT 

Estimation accuracy results for the ID-DCT are given in Tab. 5. For each of 
the modules, the name of the instance, the input word size, and the relative 
mean error (rme) is given. The module instance name is composed of the type 
of operation (e.g. “add”) followed by the stage number (1 ... 4) in the DCT and 
an additional identifier. Thus, the module “sub2_65” is a subtraction operation 
located in the second stage of the circuit. 

The mean error for all adders is 9.3%, for the subtracters 0.8%, and 19.5% for 
the multiplieres. Compared to the deviations in power consumption for different 
video sequences shown in Tab. 4, the achieved accuracy for the DCT application 
is sufficient. 

Running a gate-level netlist simulation using the Verilog based power simula- 
tor takes 138 sec, whereas the high-level estimation tool takes 8 sec. A high-level 
simulation with the Verilog simulator needs 12 sec to complete. Thus, the power 
estimation framework compares well to functional simulation. 

6 Conclusion 

This paper presented a high-level power estimation framework with emphasis 
on an easy integration into the design-flow for signal processing architectures. 
Through an object-oriented design of the framework and the reduction of unnec- 
cessary overhead in the model, extending the module library requires moderate 
additional extra work and a short module characterization time during the de- 
sign process. Simulations determining the model accuracy by comparing results 
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Table 5. Estimation results for one-dimensional DCT 



Module 

instance 


word 

size 


raie 


addl_07 


9 


-4.8 % 


addl_16 


9 


-5.5% 


addl_25 


9 


-5.7% 


addl_34 


9 


-5.9 % 


add2_03 


10 


-7.6 % 


add2_12 


10 


-7.3 % 


add2_65 


10 


-15.0% 


add3_01 


11 


-8.8 % 


add3_45 


18 


12.3% 


add3_76 


18 


13.7% 


add4_23 


11 


-13.3% 


add4_47 


19 


-4.6 % 


add4_56 


19 


-7.4% 


add4_6 


26 


-8.1% 


add4_7 


26 


-7.5 % 



Module 

instance 


word 

size 


rme 


subl_07 


9 


-13.5% 


subl_16 


9 


-13.2% 


subl_25 


9 


-13.4% 


subl_34 


9 


-13.4% 


sub2_03 


10 


-13.2% 


sub2_12 


10 


-13.5% 


sub2_65 


10 


-15.8% 


sub3_01 


11 


-13.6% 


sub3_45 


18 


11.8% 


sub3_76 


18 


10.6% 


sub4_2 


21 


-8.4 % 


sub4_3 


21 


-4.2 % 


sub4_4 


26 


-9.6% 


sub4_5 


26 


-8.3% 



Module 

instance 


word 

size 


rme 


mults2_4 


10 


-6.2 % 


mults2_7 


10 


0.4% 


mults4_2 


11 


15.3% 


mults4_23 


12 


-12.2% 


mults4_3 


11 


-4.2 % 


mults4_4 


16 


0.5% 


mults4_47 


16 


-5.8 % 


mults4_5 


16 


38.7% 


mults4_56 


16 


-5.6 % 


mults4_6 


16 


12.1% 


mults4_7 


16 


22.3 % 



obtained by time consuming gate-level simulations to the power consumption 
estimated by the framework show reasonable good quality. The application of 
the power estimation to the design of a DCT architecture give deviations to the 
gate-level simulation that are of the same order like the deviations imposed by 
applying different data sequences to the architecture. Hence, using the imple- 
mented modeling framework, it could be shown that the investigated approach 
is of sufficient accuracy. 
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Abstract. In this paper, we describe a new encoding technique which 
reduces bus line transition activity for power-efficient data transfer over 
wide system buses. The focus is on data streams whose statistical pa- 
rameters such as transition activity are either non-stationary or a priori 
unknown. The proposed encoding technique extends the Partial Busin- 
vert encoding method [1] with a dynamic selection of the bus lines to be 
encoded. In this work, we present the encoding algorithm and a low power 
implementation of a corresponding coder-decoder system. Experiments 
with real-life data streams yielded a reduction in transition activity of 
up to 42 % compared to the uncoded data stream. 



1 Introduction 

The minimization of on-chip power dissipation is nowadays a key issue in the de- 
sign of highly integrated electronic systems. There are two main reasons for this: 
First, the prolongation of operating time of battery powered mobile applications 
and second, the reduction of on-chip heat generation. 

The power dissipated on a clocked system bus of a CMOS circuit is approx- 
imated by the following equation: Py = ^Vddf^7=o where n is the bus 

width, / the bus clock frequency, Vdd the operating voltage, the parasitic 
capacitance and ai the transition activity of bus line i, respectively. Usually 
parasitic capacitances of bus lines exceed module-internal capacitances by some 
orders of magnitude, therefore up to 80 % of the total power dissipated on a 
chip are dissipated on system buses. At higher levels of design abstraction the 
designer has usually no influence on the choice of parameters such as operating 
voltage and bus clock frequency and cannot affect intrinsic parasitic capaci- 
tances. In most cases the only parameter in the equation given above that can 
be optimized at higher levels of design abstraction is the transition activity. 

In this work we present a new technique for system bus encoding in order to 
minimize bus line transition activity. We refer to it as Adaptive Partial Businvert 

* This work is sponsored by the DFG within the VIVA research initiative. 
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Encoding (APBI). Our technique is based on the Businvert encoding scheme 
[2]. We extend the method of Partial Businvert encoding published in [1] with 
an adaptive component. Based on the statistics of the data stream observed 
during system operation APBI dynamically selects a subset of bus lines to be 
encoded using the Businvert encoding scheme. Our encoding technique requires 
one additional bus line and data are transmitted over the bus each cycle (e.g. 
we do not exploit spatial redundancy) with a delay of one clock cycle. 

In contrast to all static encoding schemes that have been published so far, 
the ability of our encoding technique to adapt to a changed characteristics of 
the transmitted data stream eliminates the necessity of a priori knowledge of its 
statistical parameters for selecting an appropriate encoding scheme. Therefore 
our method is especially suited for system buses that transport data streams 
with unknown or strongly time-varying distribution of transition activity. For 
such data streams our method yielded a reduction in transition activity of up to 
42 %. 

The paper is structured as follows: Section 2 gives an overview of related 
work and the motivation of this work. Some preliminaries are given in Sect. 
3. In Sect. 4 we describe the algorithm of APBI encoding. An efficient, power- 
optimized implementation of a corresponding coder-decoder system is given in 
Sect. 4.3. Experimental results are shown in Sect. 5. Section 6 summarizes the 
paper. 

2 Related Work and Motivation 

Different application-specific methods for system bus encoding have been pub- 
lished, that exploit the characteristics of the transmitted data stream. In mi- 
croprocessor systems, typical streams can be grouped into data and instruction 
streams and address bus streams. The Businvert encoding scheme [2] is applica- 
ble for both kinds of streams and minimizes the Hamming distance between the 
current state of the bus and the following data word to be transmitted. If more 
than half of the bits would change, the data word is inverted. An additional 
bus line is used to signal the data sink if the word has been inverted. The TO 
encoding scheme [3] exploits the high in-sequence portion of address bus data 
streams generated by microprocessors. Consecutive addresses are transmitted by 
setting an increment signal on an extra bus line, while at the same time freezing 
the bus state. The data sink calculates the new address by adding a constant 
increment value to the last address. The ’’Beach Solution” [4] uses a statistical 
analysis of an application specific address data stream, followed by the genera- 
tion of a transition minimizing bus code. Combined encoding schemes published 
in [5] optimize the encoding for different data and address streams multiplexed 
over a single bus. 

In the case of uncorrelated bus lines which have uniformly distributed switch- 
ing activity the Businvert method is optimal [2]. However, on real buses, switch- 
ing activity is often distributed in a non-uniform fashion or bus lines are spatially 
correlated. In [I] it was shown, that for these cases the performance of Busin- 
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vert encoding can be improved if bus lines that have a lower switching activity 
and spatial correlation than other lines are excluded from encoding. Encoding 
these lines would rather increase total transition activity than reducing it. The 
corresponding published technique is called Partial Businvert Encoding (FBI). 
Selecting k lines to be encoded out of a n-bit wide bus has a complexity of 
0(2"). Therefore, for wide buses in [1] a heuristic approach of complexity 0{n) 
is described that selects a sub bus which includes lines of high transition activity 
and high spatial correlation. 

All encoding schemes mentioned so far presume the knowledge of the statis- 
tics or the characteristical nature of the data streams to be transmitted. For 
many applications, benchmark data streams are not available for all possible 
operating conditions or the streams have non-stationary statistical parameters 
such as a time-varying switching activity on the bus lines. For these cases static 
encoding schemes are inefficient. Rather, encoding techniques are needed that 
have the ability to automatically adapt to a priori unknown or time-varying 
statistical parameters of data streams. An approach to bit-wise adaptive bus 
encoding is published in [6]. We refer to it as lAEB (Implementation of Adap- 
tive Encoding presented by Benini et al). Based on the analysis of the number of 
state changes on a bus line in a sampling window of fixed size, from four possible, 
simple encoding schemes the one that minimizes average line activity is chosen. 
Unfortunately, the power dissipation of the corresponding coder-decoder system 
over-compensates the achieved reduction in transition activity. So, in this work 
our focus is on a less power consuming approach for adaptive encoding, that 
yields a higher effective reduction in power consumption. 

3 Preliminaries 

For characterization of the efficiency of bus encoding schemes we define the 
following equations: The efficiency of an encoding scheme describes the re- 
duction of switching activity a on the bus. It is defined as follows: 

_oo<F„<l. (1) 

^uncoded 

Ed describes the performance of the encoding algorithm and is independent 
of implementational aspects such as the target technology. Because implemen- 
tations of coder-decoder systems dissipate power themselves, an effective power 
reduction is only achieved after compensation of that portion of dissipated power. 
This is illustrated with the following power-balance equation: 

Pv, uncoded — Pv, coded Pv, Codec Pv, saved ( 2 ) 

where Pv, uncoded, Pv, coded is the power dissipated on the uncoded and the coded 
bus, respectively, and Pv,Codec represents the power consumption of the coder- 
decoder system. We now define the efficiency Ep of an encoding scheme regarding 
the reduction of the power dissipated on a bus by: 

Pv, saved , Pv, coded Pv,C odec ^ 7-1 

Ep = — = 1 ; -00 < Ep <1. (3) 

uncoded uncoded 
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Ep depends on the target technology the coder-decoder system is implemented 
with. In order to effectively reduce the power dissipated on the bus and the coder- 
decoder system, Ep must have a value greater than 0. From (2), the effective 
capacitance Ci^eff results which is the average minimum capacitance of a bus 
line for Ep > 0: 



Cl, 1 > Ci^eff 



Ecodec 

2 ^coded) 



( 4 ) 



4 Adaptive Partial Businvert Encoding 

4.1 Overview 



For a n-bit system bus whose lines have uniformly distributed switching activity, 
a one-probability of p = 0.5 and whose lines are not correlated, e.g. it resem- 
bles an identical, independently distributed source (i.i.d. source), the Businvert 
encoding method is optimal [2], and, as we have shown in [7], it has a coding 
efficiency oi E^ = 1 — Yl,k=i E.g., for a 32 bit bus a reduction in 

switching activity of 12 % is achieved. The Businvert encoding scheme is defined 
as follows 



= {X^INV} 



{Q*, 0} : else. 



( 5 ) 



Q* and represent the uncoded and encoded data words, respectively, and 

W(x) is the weight (number of ones) of a binary vector. In real applications, as 
for example image processing systems, switching activity is usually distributed 
over bus lines in a non-uniform fashion and bus lines are more or less corre- 
lated. If the activity distribution and line correlation are known and stationary, 
FBI combined with the heuristics for selection of the sub bus to be encoded as 
described in [1] represents an efficient encoding technique. However, if activity 
distribution and line correlation are unknown or non-stationary, the algorithm 
for static selection of the sub bus for FBI can not be applied or yields an ineffi- 
cient solution with respect to the resulting coding efficiency. For these cases E^ 
can be improved if the choice of the sub bus to be encoded is adapted to the 
statistical parameters in certain time intervals. This can be interpreted as the dy- 
namic adaptation of a coding mask maskff) = {mQ,m\, \ m\ € {0, 1}}, 

where m\ = 1 at time t means that the i-th bus line is included into the encoded 
subset of bus lines. Considering this, we can derive the AFBI encoding algorithm 
through extension of the Businvert encoding algorithm: 

= {X^INV} 

^ f {g‘, 0} : 

1 {g* © mask{t), 1} : else. 

^mff = (^ks ® • maskit). 



( 6 ) 
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The bus data stream is then decoded in the following way: 

= l :INV = 0 

^ \{Xl®mask{t)} : INV =1. 



( 7 ) 



The Businvert encoding method is a special case of APBI with all mask bits 
constantly set to to* = 1. Figure 1 visualizes the concept of an Adaptive Partial 
Businvert coder-decoder system. The encoder consists of a Businvert coder, a 
mask computation logic and bus line selection logic. The mask computation logic 
calculates from the input data stream the encoding mask mask(t), which is then 
used to select the bus lines to be encoded. The encoding mask is also fed into 
the Businvert coder block because, according to (6), the encoder output is a 
function of both the input data and the weight of the encoding mask bit vector. 
The APBI decoder consists of a Businvert decoder, a bus line selection logic and 
a mask computation logic which is the same as in the encoder. In order to be 




Fig. 1. Block Diagram of an APBI Coder-Decoder System 



able to decode the encoded bus data stream correctly, the APBI decoder has 
to have knowledge about the encoding mask that was used to encode the data. 
For that purpose, the mask computation logic in the APBI decoder extracts the 
encoding mask from the decoded data. The mask value is then used to select 
from the encoded bus all lines that were left uncoded and combines them with 
the output of the Businvert decoder for all decoded lines in order to obtain the 
decoded data. For correct extraction of the value of the encoding mask in the 
decoder, both encoder and decoder have to use the same initial mask (e.g after 
system reset) and the same size of the sampling window which is used for mask 
computation. 

4.2 Encoding Mask Computation 

The coding efficiency Ea of APBI directly depends on the proper selection of 
the coding mask mask(t). In general mask(t) is a function of switching activities 
ai, spatial and temporal correlation pij of the bus lines and the total switching 
activity atot of the uncoded bus. A hardware implementation for estimation 
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of correlation coefficients would be extremely costly, because for every pair of 
bus lines one counter counting joint one-states would be required, which are 
(^ 2 ^) “ counters for a 32 bit bus. The implementation would be that power 
consuming that no effective reduction in power dissipation could be achieved. 
For that reason and because experiments showed that, compared to activities 
the correlation coefficients pij have a negligible influence on the choice of 
the coding mask, line correlation is not considered in the mask computation 
algorithm. We rather restrict ourselves to determine the t-th mask bit from the 
switching activity of the corresponding bus line and the total switching activity 
atot of the uncoded bus: 

mask{t) = {ml, 

— {Tq (oq , CKtot ) j (oi , Oitot) 5 ■ • • j ^n— 1 (On— 1 , ) } ■ 

where n is the bus width. The choice of the functions Fi is based on the following 
considerations: The average switching activity per line can be calculated by ai = 
dtotln. Having an i.i.d. source, switching activity is nearly identically distributed 
over all bus lines, so « cXi. For an i.i.d. source the Businvert encoding method 
is optimal, that means, all lines should be included into encoding to achieve the 
best reduction in transitions. If switching activity is non-uniformly distributed 
over bus lines Oj will serve as boundary to decide whether a bus line is included 
into encoding or not. So we determined the following functions Fi 



mi F{cxi, (Xg^s) 



f 1 : a* > atot/n 

\ 0 : a* < atotin. 



(9) 



The values for and atot determined by windowing the data stream using 
N samples of input data per window and counting the transitions within each 
window. At the end of every window the new coding mask is calculated from 
at and atot- There is a tradeoff between the window size N and accuracy of 
at and atot- These values become more accurate with larger windows. On the 
other hand a larger window size results in higher implementation costs for the 
coder-decoder system and increases its response time to a change in statistical 
parameters of the input data stream. Experiments showed that a window size of 
N=32 is a good compromise. 



4.3 Implementation 

Due to limited space, we will not show the implementation of the BI coder and 
decoder blocks in the APBI coder-decoder system (these can be found e.g. in 
[2]) but restrict ourselves to present a power-efficient implementation for the 
mask computation logic. An efficient implementation according to (9) is shown 
in the block diagram in Fig. 2. The selection algorithm requires the determi- 
nation of the switching activities of each uncoded bus line as well as the total 
transition activity of the uncoded bus. Transitions are detected by xoring the 
current and last data word which is stored in a n-bit register. The resulting signal 
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Fig. 2. Efficient Implementation of the Mask Computation Logic 



serves as count enable for the bit line transition counters TCo..TCn-i- At the 
same time the weight function Weight computes the number of total transitions 
between two consecutive data words that is added up in the Total Transition 
Accumulator (TTA). In order to reduce glitches, the weight function block is im- 
plemented with a balanced Carry-Save Adder tree. At the end of each window 
the windows counter (WC) produces an update signal for the registers that store 
the current coding mask. The new coding mask is calculated by the functions 
Fo..F„_i, which compare according to (9) the counter results TC^.-TCn-i with 
the contents of TTA, divided by the bus width. In order to simplify the division 
operation, we restrict ourselves to bus widths that are a power of 2, so the divi- 
sion can be replaced by a much simpler shift by log 2 n operation. The resulting 
encoding mask will then be stored in the mask registers. In order to minimize 
the power dissipation of the APBI coder-decoder system we integrated a feature 
which allows to increase the mask update interval from every window up to ev- 
ery fc-th window. The mask computation logic is not required to be active for 
the windows 0...fc — 1. In this time its clock (MC clock) is turned off, which is 
realized by the Mask Update Counter (MUC) and the clock gate consisting of a 
latch and an AND-gate. The power dissipation of the mask computation logic 
could be further minimized by isolating all major asynchronous logic blocks such 
as the selection functions Tj and the weight computation logic during cycles of 
inactivity. 
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5 Experimental Results 

The proposed APBI coder-decoder system has been implemented as a synthe- 
sizable VHDL model for a bus width of 32 bits. With this model, switching 
activities of coded and uncoded buses for the following set of test data streams 
have been measured by simulation: 

— gen: A random, segmented data stream, generated with Mathematica®, with 
varying distribution of switching activity over the bus lines in every segment 

— ascii: An ASCII file in EPS format 

— binary: Example for an executable file (gzip binary) 

— image: 4 different concatenated images with varying characteristics in PPM 
format 

— noise: White Gaussian noise 

For all APBI simulations a window size of 32 samples was used. The encoding 
mask was updated in intervals of 1, 2, 4, 8 or 16 windows. APBI has been 
compared with BI, PBI and lAEB encoding, because BI and lAEB are the only 
encoding schemes which do not require any a priori knowledge of the statistics 
of the unencoded data stream. PBI was chosen since our method is derived from 
this encoding scheme. The mask for PBI was separately optimized for every test 
case using the proposed bus line selection heuristics [1]. 



Table 1. Relative Reduction in Switching Activity Regarding Tuncoded 



Sequence 


T uncod. 


T coded 


APBI32.1 


APBl32,2 


APBl32,4 


APBI32.8 


APBl32,16 


BI 


PBI 


lAEB 


gen 


2130360 


37.05 % 


41.96 % 


41.95 % 


41.99 % 


42.00 % 


12.70 % 


12.78 % 


44.08 % 


ascii 


221309 


10.57 % 


10.55 % 


10.51 % 


10.43 % 


10.18 % 


4.49 % 


11.43 % 


11.73 % 


binary 


154620 


8.78 % 


8.58 % 


8.00 % 


6.98 % 


6.60 % 


5.53 % 


7.75 % 


21.17 % 


image 


2878651 


12.26 % 


11.72 % 


11.39 % 


11.15 % 


11.60 % 


4.42 % 


10.08 % 


-9.37 % 


noise 


4086760 


7.08 % 


7.08 % 


7.11 % 


7.14 % 


7.13 % 


11.15 % 


11.15 % 


0.45 % 


[Average Reduction 


15.15 % 


15.98 % 


15.79 % 


15.54 % 


15.50 % 


7.66 % 


10.64 % 


13.61 % 



Table 1 presents the reduction of transition activity at the coded bus in per- 
cental figures, compared to the unencoded bus. As expected APBI gave the best 
reduction in transitions for the gen data stream. For binary and image APBI 
outperforms BI and PBI while it is slightly less effective for ascii. Compared 
to lAEB, APBI yielded a higher reduction for image and noise. For the other 
test streams it achieved less reduction in transition activity. The noise example 
shows that BI is optimal for an i.i.d. source which can not be outperformed 
by any other encoding scheme. But on average APBI has a higher reduction in 
transition activity than every other investigated scheme. 

In a second experiment we determined the power dissipation for implemen- 
tations of BI, PBI, lAEB and APBI coder-decoder systems using our test suite 
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of data streams. For that purpose the VHDL models have been synthesized 
with Synopsys Design Compiler for Fujitsu CE71 technology because it was the 
only available library with cells characterized for internal power. Other libraries 
which were available for our experiments, such as LSIlOk or XILINX XC4000, 
did not have that feature, so the measures of the power dissipation of the coder- 
decoder systems would in general be too low. The switching activities of all 
internal nodes in the resulting netlists were determined by simulation with the 
test data streams using timing annotated VITAL models. Table 2 lists the re- 
sulting power dissipation at /=50MHz und Vdd=2.5V determined with Synopsys 
Design Power, and Table 3 lists the area and critical paths of the implementa- 
tions of the corresponding coder-decoder systems. It has to be pointed out, that 



Table 2. Power Dissipation of Coder-Decoder Systems in Fujitsu CE71 Technology 



Sequence 


Pv, Codec 1 


APBl32,l 


APBl32,2 


APBl32,4 


APBl32,8 


APBI32.16 


BI 


PBI 


lAEB 


gen 


31.70 mW 


19.64 mW 


13.33 mW 


10.22 mW 


8.66 mW 


5.52 mW 


4.72 mW 


47.31 mW 


ascii 


28.96 mW 


18.64 mW 


12.80 mW 


9.96 mW 


8.58 mW 


4.76 mW 


4.05 mW 


48.58 mW 


binary 


24.86 mW 


16.48 mW 


11.28 mW 


8.65 mW 


7.40 mW 


3.83 mW 


2.08 mW 


48.17 mW 


image 


18.75 mW 


13.39 mW 


9.15 mW 


7.06 mW 


6.04 mW 


2.62 mW 


1.58 mW 


48.76 mW 


noise 


33.03 mW 


20.19 mW 


13.69 mW 


10.47 mW 


8.88 mW 


5.25 mW 


5.25 mW 


49.25 mW 



these figures completely depend on the target technology the coder-decoder sys- 
tems are implemented with. Using other technologies may possibly result in a 
lower power dissipation. Finally, Table 4 shows the effective capacitances Ceff 
calculated according to (4) for the investigated coder-decoder implementations. 



Table 3. Area and Delay for Implementations of Coder-Decoder Systems 



Measure 


APBl32,l 


APBl32,2 


APBl32,4 


APBl32,8 


APBl32,16 


BI 


PBI 


lAEB 


Critical Path (ns) 


10.4 


10.7 


10.8 


10.8 


10.8 


7.2 


5.7 


3.1 


Area (BC) 


16125 


17783 


17829 


17869 


17939 


1343 


1177 


29173 



Table 4. Effective Average Capacitances Ceff 



Seq 


T uncod. 


Ceff 1 


APBI32,1 


APBl32,2 


APBl32,4 


APBl32,8 


APBl32,i6 


BI 


PBI 


lAEB 


gen 


2130360 


33.69 pF 


18.43 pF 


12.51 pF 


9.58 pF 


8.12 pF 


17.11 pF 


14.54 pF 


42.27 pF 


ascii 


221309 


134.53 pF 


86.73 pF 


59.79 pF 


46.89 pF 


41.39 pF 


52.07 pF 


17.40 pF 


203.44 pF 


binary 


154620 


177.54 pF 


120.48 pF 


88.39 pF 


77.76 pF 


70.31 pF 


43.47 pF 


16.85 pF 


142.75 pF 


image 


2878651 


163.33 pF 


122.01 pF 


85.76 pF 


67.62 pF 


55.58 pF 


63.29 pF 


16.74 pF 


- 


noise 


4086760 


186.97 pF 


114.39 pF 


77.17 pF 


58.81 pF 


49.93 pF 


18.87 pF 


18.87 pF 


4370 pF 
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6 Conclusions 

The high efficiency of the APBI encoding technique for system buses with 
strongly time- varying activity profile could be demonstrated through the exper- 
imental results. In contrast to most static encoding schemes such as PBI that 
only have a good encoding performance for streams they are explicitly opti- 
mized for, APBI has the ability to adapt to a changing activity profile of the data 
stream to be transfered. While lAEB achieves a higher reduction in switching 
activity for particular data streams, on average APBI outperformed all other in- 
vestigated encoding schemes regarding reduction of transition activity or coding 
efficiency E^- In all test cases APBI coder-decoder implementations had a lower 
power dissipation than their lAEB counterparts. The resulting effective capac- 
itances Ceff show, that partly higher reductions in switching activity achieved 
by lAEB cannot compensate the higher power dissipation of the coder-decoder 
system. Tolerating a slight deterioration in coding efficiency E^, the power dis- 
sipation of the APBI coder-decoder system can be further reduced by enlarging 
the mask update interval. Update intervals of 2, 4 and 8 gave an acceptable 
reduction in switching activity with essentially reduced power dissipation of the 
coder-decoder system. Our coding scheme can be applied for highly capacitive 
system buses, e.g. bus lines which cross chip boundaries, whose activity profile 
is heavily changing over time or is a priori unknown. 
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Abstract. A new probabilistic method to estimate the switching activity of a 
logic circuit under a real delay gate model, is introduced. Based on Markov 
stochastic processes and generalizing the basic concepts of zero delay-based 
methods, a new probabilistic model to estimate accurately the power 
consumption, is developed. More specifically, a set of new formulas, which 
describe the temporal and spatial correlation in terms of the associated zero 
delay-based parameters, under real delay model, are derived. The chosen gate 
model allows accurate estimation of the functional and spurious (glitches) 
transitions, leading to accurate power estimation. Comparative study and 
analysis of benchmark circuits demonstrates the accuracy of the proposed 
method. 



1 Introduction 

Power dissipation is recognized as a critical parameter in modern VLSI design. Thus, 
efficient low power design techniques have been developed to solve certain issues at 
all design levels [1]. Also, a number of power estimation methods for combinational 
logic circuits have been developed [2], Recently, a number of probabilistic estimation 
methods, considering zero gate delay model [3,4,5] and real gate delay model [6,7], 
were proposed. The method presented in [5] is the most accurate assuming zero-delay 
gate model since all types of correlations among the circuit signals are considered. 
The temporal correlation was captured by modelling the behaviour of a signal as a 
two state Markovian stochastic process, while the spatial correlation by the 
introduction of the concepts of the spatiotemporal transition correlation coefficient 
and the signal isotropy. 

Assuming arbitrary gate delay model, a few probabilistic power estimation 
methods have been published [6,7]. In [6], a symbolic simulation algorithm has been 
proposed. Given the switching activities of the primary inputs and using OBDDs, the 
transition probability of a node at time t resulted by XORing the Boolean functions 
that correspond to two successive switching time instances, i.e. t and t+\. The 
structural and the first-order temporal correlations are handled, but the input pattern 
dependency is not captured, since the primary inputs are assumed uncorrelated. Based 
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on the signal probability calculation method of [8], a new method for calculating the 
transition probabilities was proposed in [7]. To manipulate large circuits, an efficient 
methodology, which reduces the support set of an internal circuit node, has been 
developed. This method is parameterised in terms of the depth of the circuit levels. 
The structural and the first order temporal correlations were captured, but the primary 
inputs were considered uncorrelated. 

Considering simultaneous input transitions, structural, temporal and input pattern 
dependencies we propose a probabilistic method for accurate power estimation of 
combinatorial circuits assuming real-delay date model. It is proved that the switching 
activity estimation under real-delay gate model is transformed to switching activity 
estimation at specific time instances assuming zero-delay gate model. Based on the 
concepts of the transition probability (i.e. temporal correlation) and transition 
correlation coefficient [5] (i.e. spatial correlation), new formulas for calculating the 
transition probabilities for different time sub-intervals and the transition correlation 
coefficients of any pair of signals in terms of two time instances are proved. To 
describe the logic behaviour of a circuit node in terms of time, we adopt the notion of 
Timed Boolean Functions (TBFs) [9]. A TBF can be seen as a modified Boolean 
function, exhibiting all those properties that can model efficiently the behaviour of a 
circuit node for every time point. Manipulation of TBFs can be done by the TBF- 
Ordered Binary Decision Diagrams (TBF-OBDDs) [9], which have the inherent 
important property to solve the problem of temporal compatibility [7]. 

The rest of the paper is organized as follows. In section 2 the problem is 
formulated, while in Section 3 the mathematical model is given. The main principles 
of Timed Boolean Functions and TBF-OBDDs are presented in Section 4, while the 
procedure for the switching activity evaluation is given in Section 5. In Section 6 the 
experimental results prove the efficiency of the proposed method. Finally, the 
conclusions are presented in Section 7. 



2 Problem Formulation 

The power estimation problem of a combinational logic circuit, under real gate delay 
model can be stated as: 

"‘Given the gate level description of a combinational circuit with n inputs and m 
outputs and the inertial delays of its gates, and, assuming that the time between two 
successive applied input vectors is greater or equal to the settling time of the circuit, 
estimate the average power consumption of the circuit for an input vector stream 
through the calculation of its average switching activity.” 

It is assumed that the combinational circuit is a part of a synchronous sequential 
circuit, which means that its inputs can switch synchronously with the clock, 
performing at most one transition at time t=0 during the clock period [0,T). Moreover, 
an applied input signal is considered as an ideal step pulse without any voltage drops 
at circuit nodes, while its width is greater or equal to the period time T. 

To clarify the proposed method and the introduced concepts, a running example is 
used throughout the whole paper. 

Example: We assume a logic circuit with gate delays equal to one, shown in Figure 1. 
The logic behaviour of the node/can be described in time domain as follows: 
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/ = F(xi,X2,t) = Xi(t- 2 )x 2 (t - 2) X 2 (1 - 1) ■ ( 1 ) 




Fig. 1. A logic circuit with Unit Delay AND gates 

The signal / may switch at two time instances, i.e. t( = 1 and t( = 2 . The 
transition of the signal /at t^, t( = 1 , depends on the transitions of the primary inputs 
x^ and Xj at time points t^' = - 1 , t/ =-l, and = 0 , while the transition of / at 

tj = 2 depends on the transitions of the signals x^ and at = 0 , =0, and = 1 . 

The corresponding logic functions of / derived hy (1) 
at t( =1 and =2 are: /j = F(xi,X 2 ,X 3 ,l) = Xi(-l)x 2 (-l)x 3 ( 0 ) and 

/2 = F(X[ , X 2 , X 3 , 2) = X[ (0)X2 (0)x3 . 

From the above example it is proved that the switching activity considering real- 
delay gate model is transformed to the switching activity estimation at multiple time 
instances. Also, the logic behavior of signal /at any time instance is described by the 
modified logic function of eq. (1). However, at each time instance the modified 
function is reduced to an ordinary Boolean function, where the Boolean variables are 
the corresponding logic values of the input signals at specific time points. 

Having known the probabilistic properties (e.g. the switching activity) of its 
variables and manipulating the modified function efficiently the switching activity at 
any time point can be evaluated in similar manner to zero delay methods. 

Having as starting point eq. (1), a new mathematical model, which describes the 
behavior of a logic signal in terms of time and signal correlation, should be 
introduced. We aim at the development of the new method, which reduces the power 
estimation problem with real delay model to a zero delay problem at certain switching 
time points. For that purpose, we introduce new formulas, which express parameters 
of real delay, that model power estimation problem in terms of zero delay parameters. 



3 Mathematical Model 

The behavior of a binary signal, x, at a time point, t, i.e. x(t), is modelled as a random 
variable of a time homogeneous. Strict Sense Stationary, lag-one Markov stochastic 
process having two states, s, with i'G 5={0,1 }. 

The transition probability, ph (t ) , expresses the probability of a signal x to 
perform a transition from the state k to the state / within two successive time points 
(q-l)T and qT, where q is integer and T is the period of the input signals , and can be 
defined by: 
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pli{t) = p{x((q-l)T) = kA x{qT) = l) ^kje S. (2) 

The switching activity, E^( qT) , of a signal x at time instance qT is given by: 

£"W = Poi(0 + KoW. (3) 

The above stochastic process models the behavior of an input signal at times t=0, 
t=T, e.t.c., where the input signal performs a transition. However, as it has been 
shown in the example of Figure 1, the transition probabilities ph{t) of an input 
signal X at several time points t=±d, Je{l,2}, are needed. More specifically, the 
p^i(0), p^f(-l), and p^/(l)for the signal x^ are needed. However, the transition 

probabilities of an input signal at any time point t=±d, are constant since the signal 
may perform transition at t=0 only. 

We introduce the notion of transition probabilities of an input signal x in time 
intervals (-T, 0) and (0, T) as and respectively. Their 

corresponding values can be computed by the next lemma. 



Lemma 1. The transition probability of a primary input signal, x, at time intervals (- 
T, 0) and (0, T) is expressed with respect to the transition probabilities at t=0 as: 



p^(o-)=pM+Pk(UM 

F»(0") = F,/(0)+Fa-O/(0) V/g5 



(4) 

(4.1) 

(4.2) 



p^i(0-) = p^^(0+) = 0 y k,le SAk^l . 

Proof: Due to lack of space the proof is omitted here. 

Definition 1. A Signal Transition Probability Set, P^(t), of a signal x at a time 
instance t, is defined as the set of all transition probabilities p^i(t) , where k, I eS : 

p (0 = ]poo(0> Poi(0> Plo(0> Pll(f)|- 

The accuracy of the power estimation implies that the spatial correlation among the 
circuit signals should be considered. Let x^ and x^ be two signals. The corresponding 
stochastic machine has four states, which are the four combinations of the signal 
values of x^ and x^. Based on this stochastic machine, it has been proved in [5] that the 
spatial correlation can be captured by the Transition Correlation Coefficient {TC). 
Assuming zero-delay model, the TC between two signals x^ and x^ is: [5]: 



^ ^kl, mn 



p(xj(f-l) = k AXj(f) = I A x^it -\) = m A x^{t) = n) 
p(^x^{t-\)-k Ax^(t) = l) p(^X2(t-l) = mAX2(t)-n) 



(6) 
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Since we use a real delay gate model, the notion of the TC [5] should be 
generalized for capturing the spatial correlation of two signals for any two certain 
time instances. Thus, under a non-zero delay model, we define the Generalized 
Transition Correlation Coejficient. 



Definition 2. The Generalized Transition Correlation Coefficient, TC^j’^^{t^,t 2 ) , 
between two signals and x^, which perform a transition from the states k and m to I 
and n, at times and ?2 > respectively, is defined as: 



f p{xi(ti-l) = k AXi(fi) = /AX,(f,-l) = mAX,(f,) = «) 

^ p[x^{t^-V) = k /\x^(t^) = l) p(x2(t2 -1) = n? A XjCtj) = n) 

where k, l,m,ne S. 

The spatial dependencies among three or more signals are captured by the pairwise 
TCs, approximately. For example, the TC of Xj, x^ and x, can be expressed as: 



TO, 



kl,mn,pq ' 



(?! , ? 2 » ^3 ) “ im (^1 ’ h ) pq (^1 ’ h ) ’^^im,7q (^2 ’ ^3 ) 



( 8 ) 



where k, I, m, n,p,qe S. 

Since we deal with three time instances, i.e. t=0',t=0, and t=0^ (where 0* / O' 
denotes the time intervals (-T,0) / (0,T) ), appropriate TCs between two signals x^ and 
Xj for capturing their spatiotemporal dependency should be determined. 



Definition 3. The Transition Correlation Coefficient Set, (t\,t 2 ) , between 

two signals x,and Xj at time instances t^ and t^ is defined as the set of sixteen TCs: 



TC" 






(9) 



Lemma 2. The spatiotemporal TC of two input signals Xj and x^, at fj ,f 2 g |o ,0,0^} 
are expressed in terms of their signal transition probability sets (i.e. eq. 5) and the 
associated TC set (i.e. eq. 9) at time points tj = 0 and = 0 as follows: 

TC‘^’’‘^tff2 ) = ),Pff0),P'‘^(0)) ( 10 ) 



and can be calculated by: 



TC:i’l(0-,0) = 



Tc:i:Z(0,0) pm + p ,--,)( 0 ) 

pm p1(0) + p,^;m 



(11) 



Tc;i:2(o\o) = 



plliO) + TC^,:]],Z(0,0) P(._1),(0) 

rf(0) p1(0) + P(4,(0) pZ(0) 



( 11 . 1 ) 
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fo- o-i= pS(0) p 1(0)+ rQ:4-„)(0.0) pS( 0) /’4l„)(0) 

p1(0) + p5(0) p„,;i„)( 0)+ p,p\)(0) p1(0)+ p,j;l.)(0) P„,?.„)(0) 

rQ(;‘r):;„(0.0) PMtr)(0) Pl(0) + rQ(;.!.)4_„)(0.0) p 41 „)( 0 ) (11-2) 

Ph(0) p1(0)+pS(0) P„(f_„)(0)+ Pt(i\)(0) p1(0)+ p,(t,)(0) p„(;!„)(0) 

P?(Q) P."(Q)+ Tq;;^Z)Sm p5(0) P(,-%„(Q) ^ 

^ ’ rf‘(0) p,':(0)+p;(0) p„:‘,,(0)+ p,4,(0) p^(0)+ p,;;,,(0) p^^io) 

TC^r-%,l(m P(i-1y(0) /^.1(0) + rq4.A)„(0^0) /^(i:v(0) P(4„(0) 

p?(0) p2(0)+rf(0) P(4„(0)+ /7(4,(0) p:i(0)+ p^,:]^,(0) p^,_%(0) 

TQ:1(0,0) = TQ:Z(0\0) = 0 V 5 A k^l, (11-4) 

7’Q:2(0^01 = 7’Q:™(0".0") = 0 V k,l,m,tiG S a k^l a m^n (H-5) 
where k, I, m, n e S. Proof: Due to lack of space the proof is omitted here. 



4 Timed Boolean Functions 

As it has been mentioned, the glitch generation is strongly dependent on the time. 
Therefore, a modified Boolean function, which will describe the logic and timing 
behavior, is needed. This modified Boolean function, called Timed Boolean Function 
(TBF) and its mathematical foundation of TBFs was presented in [9]. Exploiting the 
timing properties of the input signals, a range interval (0,T) can be partitioned in 
coarsest of sub-intervals, within each of which a TBF is an ordinary Boolean function. 
Each input signal with specified switching times can be represented by a TBE, using 
the unit step function. It is also has been proved in [9] that any binary signal with 
known switching times can be represented by a TBF using the unit step function. 
These inputs are modeled by TBFs and can be represented by a set of time intervals 
and the corresponding Boolean functions in each interval. Consequently, a TBF of an 
internal node can be regarded as a transformation of a set of intervals and Boolean 
functions of the inputs to another set of intervals and Boolean functions, considering 
the gate delays and their Boolean operations. 

A gate operation is logically and temporally separable if the computation can be 
performed in two separate steps: i) delay the inputs and ii) perform Boolean operation 
on the delayed inputs. The TBF of the circuit output is a composition of the TBFs of 
its gates. Thus, the TBF of an output/is given by: 

f(t) = F(t,Xi,....,x„) = F{xi(t-di),...,x„(t-d„)) , (12) 

where d.is the delay of the path starting from the input x^ and terminated at/. 
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Definition 5. The Boolean variables of an input signal, x; (t) , which is modeled as 
TBF, are defined by: 



II 

1 

m 

T 

o 


(13) 


X;(t) = X;"^ if te (0,T) . 


(13.1) 



Example: The corresponding TBFs of the nodes and / of the circuit shown in 
Figure 1 are: 

Ml(t) = Xi (t-1) Xj (t-1) , (14) 

/(r)=«j (t-l)x2 (t-1) =X[ (t-2)x2 (t-2)x2 (t-1) . (14.1) 

Considering the TBF of node /, we infer that there exist two valid switching time 
points, namely t=\ and t=2 and therefore, three time intervals, 
i.e.(-oo,l), (1,2) and (2,+ c«). All the TBF variables are positive for te(2,+ o°) (i.e. 

xi^,X 2 )■ Thus, within this time interval the signal /does not perform a transition and 

the TBF is reduced to /=x)^xj . Similarly, the associated Boolean functions in the 

transitionless time intervals (-°°,1) and (1,2) are /=Xj^X 2 and /=xfx 2 X 2 , 
respectively. 

In order to manipulate the TBFs efficiently the TBF-OBDDs has been presented in 
[9]. More specifically, a TBF-OBDD consists of the upper BDD called ^T-OBDD, and 
a set of ordinary OBDDs. The purpose of the ^f-OBDD is to represent the associative 
time intervals of the TBF. Any leaf node of the ^T-OBDD is a dummy node, which is 
replaced by the OBBD of the Boolean function of the corresponding interval. Thus, 
the OBDD of the ordinary Boolean function corresponding to the time interval 
( t, , ) , is the OBDD that replaces the leaf node of the right branch of the node K. 

Also, the OBDD of Boolean function of the time interval (f,_[ , t, ) corresponds to the 
OBDD that replaces the rightmost leaf node of the left branch of the node (which is 
the same with the OBBD of the right branch of node Ai /. It has been proved in [9] 
that the TBF-OBBDs are canonical and can be reduced and manipulated as ordinary 
OBDDs. The corresponding TBF-OBDD of node/of the example 1 is given below: 



5 Switching Activity Evaluation 

Generally, a signal / of the logic circuit is a Boolean function of a subset of the 
primary input signals, i.e. / = F(xj(t),X 2 (t),...,x^(t),), where v <n . 
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Fig. 2. The TBF-OBDD of circuit node/of Figure 1 



Definition 4. We define as Valid Time Points Set, = \t( ,t[ ,...,t{ |, the transition 

time points of a signal/. 

Determining the valid time points, the switching activity estimation problem is 
reduced to the estimation of P (t) V tG T-^ . 

A circnit node, / switches at time t = t- , if the derivative with respect to time t of 
its TBF is equal to 1. Thus, the average switching activity, (t/ ) , is: 

Ef(t( )==p dim {f(t{ +£)}=!) . 

Instead of performing the XORing between the Boolean fnnctions corresponding to 
the time intervals and (t/,f4i)> we can manipnlate efficiently the TBF- 

OBDD, in order to evaluate the switching probability of node / at time point t = t- . 
Taking into consideration the representation of the ordinary Boolean function on the 
TBF-OBDD, the evaluation of the transition k-^l with k, le S at time instance tf can 
be done as described by the following steps (similar to the zero-delay procedure of 

[5]: 

i) Find the set of paths, IVl{t = t(), of the OBDD of the Boolean fnnction 
corresponding to the time interval (t4iT/ ) . which terminate at node k, 

ii) Find the set of paths, I\-^{t = t(), of the OBDD of the Boolean fnnction 
corresponding to the time interval (t/ ,t4i) > which terminate at node I, 
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iii) Combine each path of n^(t = t/ ) with all paths of n^(t = t/ ) and extract 
the switching behavior taking into account the temporal compatibility of each 
primary input signal using the following equation: 



p.',0= I rcj; 



.kil 






(16) 



As it has been mentioned, each input signal x has at most two Boolean variables, 
namely the variables x(0')=x' and x(0^)=x*. The values of the variables x and x* denote 
the values of the signal x for the time intervals (-T,0) and (0,7), respectively. 

In case of the variable x or x* is appeared in both paths n and n' then its value 
must be the same in both paths. This property solves the problem of temporal 
compatibility reported in [7]. 



6 Experimental Results 



The proposed power estimation method is implemented by ANSI C language, 
whereas its efficiency is proved by a number of ISCAS'85 benchmark multilevel 
circuits. For the technology mapping step, a general library of primitive gates (i.e. 
AND, OR, e.t.c.) of up to 4 primary inputs, is used. For a signal x, we define as Real 

Node Error the quantity £r/(x)=|£'gjy(x)-£''^(x)jy/£'gjy(x), where E^g{x) is the real 

effective switched capacitance of signal x and £'^(x)is the estimated effective 

capacitance. The effective switched capacitance is calculated by the product of the 
switching activity E{x) and the total capacitance load of the node x, C=FC^, where E^ 
is the fanout of this node and = 0.05 pF is a typical input gate capacitance. 

For a combinatorial circuit with N signals and a specific input vector set F, we 

N 

■VL-f-'ZE.ffix,), 

i=l 



define as Total Power Consumption the quantity Power(yj)= ^ 



as 



Real Total Error the quantity 

Total Error(y.^= Power(y.^—Power(y.^ j Power(v.^ , as Real Mean Error the 



quantity MeanErrony j)= — '^Ert{xi ) and finally as Real Maximum Error the 

1=1 

quantity Max Error (Vy ) = max {Err (xj ), Err (x 2 ),•■•, Err {xj^ )}. If we choose M 
input vector sets (for a reliable comparison) the above formulas become: 



Power*’ = ^ Powerjv ^ ) , Total Error*’ = ^ Powerjv Powerjv {) / ^ Pcwerjv ^ ) , 



2 M 

Mean Errcr*’ = — Mean Error{y ) and Max Error*’ =max{Maan £>ror(Vj ), . . Mean Emr{V^ )} 
M 
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For comparisons, three categories of input vectors are chosen: i) without spatial 
correlation (column NO), ii) with low-spatial correlation (column LOW) and hi) with 
high spatial correlation (column HIGH). For each category and circuit, M=10 input 
vector sets of 50000 vectors are generated. 

We compare the proposed method and the method of [7] with Mentor's Graphics 
QUICKSIM II gate level simulator. The power consumption differences between each 
method and switch level simulator are depicted in Table 1 and 2. In particular, the 
columns TOTAL represent the Total Error^ of the total power dissipation for all of 
the 10 estimations, the columns MEAN is the Mean Error^ error, while the columns 
MAX contain the Max Error^ error for the 10 power estimations. 

Table 1 shows the errors in power estimation (%) of the proposed method and 
proves the quality of the method. The average TOTAL error is about 0.07 % for NO 
spatial input correlation, 1.62 % for LOW spatial correlation, and 1.67 % for HIGH 
spatial correlation. The corresponding average MEAN error values are 0.88 %, 4.91 
%, and 5.74 %, while the average MAX errors are 1.77 %, 7.65 %, and 8.85%. Table 
2 shows the errors of the method of [7]. It can be seen that for NO spatial correlation 
column, the error values of [7] are comparable with the corresponding errors of the 
proposed method. In contrary, the average errors of the remaining two categories (i.e. 
LOW and HIGH spatial correlation) are large enough, that is, 7.60 % and 9.27 % for 
TOTAL power, 15.74 % and 18.83 % for MEAN power, and 24.96 % and 30.84 % 
for MAX power, respectively. It is concluded that the lack of the spatial correlations 
in the primary inputs increases the power estimation error (e.g. for HIGH correlation, 
the MAX error of circuit cm82 is more than 55%). 



Table 1. Power estimation errors of the proposed method 





TOTAL 


MEAN 


MAX 1 


Circuit 


NO 


LOW 


HIGH 


NO 


LOW 


HIGH 


NO 


LOW 


HIGH 


9symml 


0,012 


0,406 


0,007 


0,549 


2,537 


3,068 


0,940 


3,025 


3,582 


C17 


0,006 


1,174 


1,124 


0,271 


2,140 


2,527 


0,971 


3,411 


4,609 


Cml63 


0,019 


1,968 


2,260 


0,870 


6,376 


7,409 


1,592 


9,670 


11,227 


Cm42 


0,022 


0,816 


0,946 


1,037 


6,411 


7,393 


2,270 


9,940 


11,514 


Cm82 


0,012 


3,422 


3,357 


0,265 


4,886 


5,581 


0,480 


8,592 


9,354 


Cm85 


0,007 


2,092 


2,353 


0,789 


4,185 


4,836 


1,860 


7,370 


8,468 


Cmb 


0,205 


0,131 


0,350 


2,265 


5,193 


6,394 


4,512 


8,491 


9,696 


Cu 


0,336 


0,080 


0,000 


1,134 


2,268 


2,634 


1,625 


2,827 


3,349 


Decod 


0,104 


1,886 


2,177 


1,577 


9,341 


10,885 


1,973 


15,865 


18,346 


F51m 


0,058 


1,814 


1,616 


0,597 


3,446 


3,811 


1,157 


4,766 


5,105 


Majority 


0,017 


0,339 


0,029 


0,482 


3,230 


4,390 


1,503 


4,820 


6,424 


Pml 


0,067 


2,091 


2,385 


1,721 


10,093 


11,563 


4,568 


15,586 


17,877 


Rca4 


0,108 


6,072 


6,852 


0,325 


6,767 


7,576 


0,583 


11,476 


12,940 


x2 


0,010 


1,034 


0,999 


0,797 


5,337 


6,383 


1,308 


6,652 


8,129 


Z4ml 


0,038 


0,941 


0,515 


0,398 


1,506 


1,641 


1,185 


2,185 


2,124 


Average 


0,068 


1,618 


1,665 


0,872 


4,914 


5,739 


1,768 


7,645 


8,850 
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Table 2. Power estimation errors of method [7] 





TOTAL 


MEAN 


MAX 1 


Circuit 


NO 


LOW 


HIGH 


NO 


LOW 


HIGH 


NO 


LOW 


HIGH 


9symml 


3,205 


0,616 


1,685 


2,657 


7,818 


3,068 


4,007 


10,611 


15,707 


C17 


0,188 


7,846 


9,469 


0,835 


13,534 


16,270 


1,437 


30,645 


39,011 


Cml63 


0,097 


5,824 


7,081 


1,354 


15,045 


18,590 


2,219 


21,615 


27,256 


cm42 


0,006 


1,986 


2,405 


1,455 


16,029 


19,398 


3,148 


24,035 


29,262 


Cm82 


0,170 


20,576 


24,716 


0,757 


25,917 


31,009 


1,156 


45,345 


54,872 


Cm85 


0,079 


8,814 


10,615 


1,521 


12,256 


15,213 


5,324 


22,205 


26,182 


Cmb 


0,105 


2,555 


3,045 


1,291 


7,917 


9,399 


2,137 


11,758 


14,153 


Cu 


0,467 


2,260 


2,880 


1,732 


6,602 


7,790 


2,312 


8,978 


10,622 


Decod 


0,038 


7,595 


9,114 


2,119 


22,958 


28,293 


3,547 


36,668 


48,091 


F51m 


0,131 


8,764 


10,698 


1,216 


13,119 


15,714 


1,862 


22,982 


27,048 


Majority 


0,032 


4,978 


5,940 


1,005 


12,503 


14,780 


2,979 


22,060 


25,969 


Pml 


0,041 


5,631 


7,028 


2,208 


24,772 


32,438 


5,181 


34,826 


44,208 


Rca4 


0,196 


22,786 


27,680 


1,293 


28,368 


34,852 


2,189 


43,239 


52,237 


X2 


0,041 


5,815 


6,988 


1,196 


19,366 


23,846 


1,939 


24,148 


29,069 


Z4ml 


0,169 


7,972 


9,744 


1,156 


9,911 


11,834 


1,943 


15,287 


18,831 


Average 


0,331 


7,601 


9,273 


1,453 


15,741 


18,833 


2,759 


24,960 


30,835 



7 Conclusions 

The proposed method constitutes an extension of the zero delay probabilistic method 
that presented in [5] and takes into account the first-order temporal correlations and 
the spatial correlations not only for the logic circuit structural dependencies but also 
for the data dependencies at the primary input signals as well. Since the proposed 
method is a global approach, our future work is to implement a method that 
propagates the primary input statistics and correlation coefficients through the logic 
network estimating efficiently the switching activity at any node and any valid time 
point. 



References 



1. J. Rabaey and M. Pedram, “Low Power Design Methodologies,” Kluwer Academic 
Publishers, 1996. 

2. F. Najm, “A Survey of Power Estimation Techniques in VLSI circuits (Invited paper),” in 
IEEE Trans. On VLSI, vol 2, no 4, pp. 446-455, December 1995. 

3. F. Najm, “Transition Density: A new measure of activity in digital circuits,” in IEEE Trans. 
On CAD, Vol. 12, No. 2, pp. 310-323, February 1995. 

4. P. Schneider and U. Schlichmann, “Decomposition of Boolean functions for low power 
based on a new power estimation technique,” in Proc. of Int.Workshop on Low Power 
Design, pp. 123-128, NapaValley, CA, April 1994. 

5. R. Marculescu, D. Marculescu, and M. Pedram “Efficient Power estimation for highly 
correlated input streams,” in Proc. of DAC. pp. 628-634, 1995. 


















































































































































































Accurate Power Estimation of Logic Structures Based on Timed Boolean Functions 87 



6. J. Monteiro, A. Ghosh, S. Devadas, K. Keutzer, and J. White, “Estimation of average 
switching activity in combinatorial and sequential circuits”, in IEEE Trans, on CAD, Vol. 
16, No.l, pp. 121-127, January 1997. 

7. J.C. Costa, J.C. Monteiro, and S. Devadas, “Switching Activity Estimation using Limited 
Depth Reconvergent Path analysis”. In Proc. of ISLPD, pp. 184-189, 1997. 

8. K. Parker and E. McCluskey, “Probabilistic Treatment of General Combinational 
Networks”, in IEEE Trans, on Electronic Computers, c-24(6), pp. 668-670, 1975. 

9. W. Lam and R.K. Brayton, “Timed Boolean Functions: A Unified Formalism for Exact 
Timing Analysis” , Kluwer Academic Publishers, 1994. 




A Holistic Approach 
to System Level Energy Optimization 



Mary Jane Irwin, Mahmut Kandemir, N. Vijaykrishnan, 
and Anand Sivasubramaniam 

Department of Computer Science and Engineering 
The Pennsylvania State University 
University Park, PA 16802-6106 
http: //www. cse .psu. edu/~mdl 



Abstract. Over the past few years, the design automation community 
has expended a lot of effort in developing low power design methodolo- 
gies. However, with the increasing software content in mobile environ- 
ments and the proliferation of such devices in our day to day life, it 
is essential to take a fresh holistic look at power optimization from an 
integrated hardware and software perspective. This paper envisions the 
tools and methodologies that will become necessary for performing such 
optimizations. It also presents insights into the interaction and influence 
of hardware and software optimizations on system energy. 



1 Introduction 

Energy has become an important design consideration, together with perfor- 
mance, in computer systems. While energy conscious design is obviously crucial 
for battery driven mobile and embedded systems, it has also become important 
for desktops and servers due to packaging and cooling requirements where power 
consumption has grown from a few watts per chip to over a 100 watts. As a result, 
there has been a great deal of interest recently (e.g., [8,9,10,11,12,13,14,15,16]) 
in examining optimizations for energy reduction from both the hardware and 
software points of view. 

From the hardware viewpoint, there are several complementary energy sav- 
ing trends and techniques. These include the use of higher levels of integration 
thereby clustering components into smaller and/or less energy consuming pack- 
ages [17], the continuous scaling of supply voltages, and the use of hardware- 
controlled clock-gating [7] that automatically shuts down portions of the chip 
when not in use. Another important trend is the support of different operat- 
ing modes, each consuming a different amount of energy at the cost of a loss 
in performance. Some on-chip energy reduction operating modes are based on 
scaling the supply voltage and/or clock frequency [18,19] under low load con- 
ditions. Others are based on the transitioning of unused hardware components 
into energy-conserving modes under the direction of software control. 

However, the power-aware computing community has long claimed that the 
greatest energy savings benefits (other than supply voltage scaling) is to be ob- 
tained at the software and applications levels as illustrated in Figure 1. From the 
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software viewpoint, new compiler, runtime, and application-directed techniques 
are being developed that target improvements in the energy-performance prod- 
uct or that selectively utilize as few hardware components as possible (without 
paying performance penalties) thereby allowing the remainder to be transitioned 
into energy-conserving modes [14] . Unlike hardware optimizations where the de- 
signer is usually faced with trading performance for reductions in energy, it is an 
open question as to whether the best performance-oriented compiler optimiza- 
tions are the best from the energy point of view. 

Another important consideration when tackling the energy problem is know- 
ing the energy budget of one’s system. Ensuring that the major energy con- 
suming portions of the system are the ones being optimized will, of course, give 
the largest overall improvements. In fact, Amdahl’s Law for performance can be 
modified to apply to energy. ‘ The performance benefits to he gained using some 
faster mode of execution is limited by the fraction of the time the faster mode can 
he Msed’ becomes ‘the energy benefits to be gained by applying an energy saving 
optimization is limited by the fraction of the time that optimized component is 
used’. As an example, if one is focused on achieving energy savings in the ALU 
and the ALU accounts for only 2% of the total energy budget, then the overall 
return will be very small indeed. Thus, it is important to know the energy budget 
of the system for the intended application environment. For example. Figure 2 
shows the energy budget of the on-chip datapath and caches and the off-chip 
DRAM for two benchmark codes drawn from different application environments: 
a static compilation environment that targets array-dominated C codes (on the 
left) and a Java-based dynamic (runtime, on-demand) compilation environment. 
As we see on the left, overall energy of array based codes are dominated by in- 
struction cache due to high frequency and good locality of instruction accesses. 
We observe that state-of-the-art compiler optimizations aggressively reduce the 
number of datapath operations, thereby causing the memory-bound instructions 
to be a significant portion of the energy budget. As compared to the array based 
application, the Java code is much more computation-intensive and datapath 
energy constitutes nearly 40% of the overall energy budget. This observation is 
interesting because dynamic compilation and other features of Java in general 
exercise the memory much more than statically-compiled C codes. However, 
memory-bound characteristics of the array-based domain dominate and mask 
the expected behavior. Thus, it is important for designers to understand the 
runtime environment as well as the application characteristics before focusing 
their efforts on optimizing specific components. 

Although larger energy gains can be obtained by optimizations made at the 
software and application levels, hardware optimizations are still crucial. For ex- 
ample, we have found in our experiments that while certain widely used high- 
level compiler optimizations (e.g., loop permutation, loop unrolling, and loop 
tiling [6]) might optimize overall system energy, they can often increase on-chip 
energy, impacting chip packaging and cooling. This is due to a reduction of 
off-chip DRAM usage at the cost of an increased on-chip usage. This impact 
can be mitigated by optimizing the cache to reduce its energy (e.g., with block 
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Fig. 1. Comparison of Energy Optimizations at Different Levels. 



buffering [20] and cache subbanking [21] Also, to achieve the greatest energy 
savings, the designer must consider the interactions of hardware and software 
energy optimizations in the intended application environment. Obviously, com- 
piler optimizations impact the energy gains of cache block buffering. In fact, 
optimizations made at the software level can be negated or improved by energy 
optimizations made at the hardware level, and vice versa. 

When focusing on a set of energy optimizations it is important to be aware of 
changes in technology in order to ensure that the optimizations will be of benefit 
in the future. Supply voltage scaling, new interconnect materials, embedding ad- 
ditional components on-chip (e.g., eDRAM, RF), globally asynchronous-locally 
synchronous control styles all could have a significant impact on the relative 
benefits of a set of energy optimizations. We have looked at, in particular, how 
energy trade-offs in optimizations made in the memory system will be affected 
by the move from off-chip DRAM to eDRAM. While we concentrate here on 
reducing dynamic energy (the energy consumed when transistors change state), 
in the future power problems will be exacerbated by a dramatic increase in leak- 
age power (currently less than 2% of the power equation) due to the scaling of 
supply voltages and, thus, threshold voltages. 

The remainder of this paper details a variety of experiments we have done 
using energy modeling/simulation tools developed at Penn State in an attempt 
to address some of the issues raised above. 




[□Datapath Blcache DDcache Dlmemory ■Dmemoryl | □ Datapath B Icache □ Dcache □ Imemory B Dmemory 



Fig. 2. Energy Breakdown Between Different Components. Left: Static Compilation of 
an Array-Dominated C Code. Right: Dynamic Compilation of a Java Code. 
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2 Tools 

Tools for accurate power-performance prediction are essential for designing 
power-aware architectures, compilers, run-time support, communication proto- 
cols, and applications. Currently, there are tools to measure the power at either 
a very fine-grain (circuit or gate) level or coarse-grain (procedural or program) 
level. With fine-grain estimation, it is difficult or impossible to measure power 
usage in (future) billion transistor designs or for large programs. However, this 
is the most accurate approach to power estimation. On the other hand, coarse- 
grain measurements can only give gross estimates, but do so quite efficiently. 
Thus it is essential to provide a hierarchical spectrum of design tools for power 
estimation and optimization as shown in Figure 3. These tools can be used, as 
will be described shortly, to perform a series of ‘what if’ energy optimization 
experiments with various hardware and software design alternatives consider- 
ing the computing system as a whole, rather than as a sum of parts (processor 
core, memory hierarchy, system level interconnect, etc.). In this section, we ex- 
plain the design and use of PowerMon, a multi-granular power estimation and 
optimization tool currently being developed at Penn State. 

Power estimation can be performed at the application, procedure, and in- 
struction level granularity using PowerMon (see Figure 4). The monitoring ca- 
pability of the Operating System (OS) along with energy measurement devices is 
utilized in developing CoarseMon to provide the coarse-grain estimates at the ap- 
plication and procedure levels. The energy hot-spots identified using this coarse- 
grain tool can, later, be studied in detail using a cycle-accurate instruction-level 
simulator, FineMon. This hierarchical approach provides an efficient mechanism 
for trading the simulation time and the estimation accuracy. Further, simulators 
for different types of processors such as scalar, pipelined, superscalar and VLIW 
can be plugged into FineMon to evaluate the influence of architectural choices 
on system energy. FineMon also provides the flexibility of choosing among tra- 
ditional (direct-mapped, set-associative) and recently proposed power-efficient 
(way-prediction, sub-banking, isolated bit line, block buffering) cache archi- 
tectures [22,20,32,34,31]. The accuracy and time can further be traded within 
FineMon based on the energy-models used. 

Energy models for the datapath, control unit, system level interconnect, clock 
and memory components of the system can be either analytical or transition- 
sensitive. In transition-sensitive models, the switching activity of the input data 
is captured by the model. Analytical models, on the other hand, provide an 
energy consumption measure per access independent of the input data values. 
While transition-sensitive models are much more accurate they are also more 
time consuming to develop and incur longer running times. 



2.1 CoarseMon 

It is becoming increasingly apparent that the different hardware modules (pro- 
cessing unit, memory, disk, etc.) operating in an energy-constrained environment 
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Fig. 3. Unified Energy Estimation and Optimization Eramework. 




Fig. 4. PowerMon: Hierarchical Energy-Aware Design Eramework. 
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should support several (at least more than one) modes of operation, each con- 
suming a different amount of energy. When a module is not being exercised, 
the software can then selectively transition it to a lower energy consumption 
state/mode. Recognizing the importance of such capabilities, there has been re- 
cent interest in standardizing these different operating modes for each module in 
the form of the Advanced Configuration and Power Interface (ACPI) from both 
hardware and software vendors [24]. With time, one can expect different mod- 
ules to support several modes (many of them already support different modes), 
and it is up to the software to effectively utilize these modes to lower the overall 
energy consumption of the system. However, response times are likely to suffer 
when there is a request to a module that has been transitioned to a lower energy 
consumption mode. Hence, the software has to employ intelligent heuristics to 
determine when to cause the transitions. Monitoring the system and applica- 
tion activity (based on the past and current behavior) to predict future usage 
can be valuable towards this goal as shown in several studies [36,35,29,25]. Fur- 
ther, the current energy usage/availability would be extremely useful for doing 
application-level adaptation (to the compiler generated ‘smart’ code or directly 
to the application); i.e., execute code that is more tailored for performance if 
there is adequate power, or vice-versa. Finally, monitoring and estimation is 
crucial to the design of the operating/runtime system itself and to develop the 
energy-delay aware services that are demanded from it. 

Recognizing the importance of tools and techniques for energy monitor- 
ing/estimation, there have been prior studies looking at this issue [25,30]. Our 
coarse-grain monitoring tool, CoarseMon, attempts to reduce the overheads of 
the monitoring so that measurements can be taken more frequently and accu- 
rately. Instead of using an external device to measure the power, and interfacing 
with this device, we explore the use of energy counters that can be provided in 
the hardware. This is similar to the performance (and statistics) counters that 
many modules already support, except that they contain energy consumption 
information for the corresponding module since the last time they were read. 
Such counters can be built to monitor the signal switching activities that form 
the basis of FineMon’s transition-sensitive energy models. Periodically (using the 
traditional timer-based mechanism), the OS reads off these counters and writes 
them into logs. Periodically, these logs can be flushed to the disk without becom- 
ing significantly intrusive on actual execution. The logs would contain program 
counter information (available on the stack during an interrupt) and the current 
energy counters in the last interval. Energy counter values can then be used 
to drive analytical energy models to estimate the consumption. Post-processing 
these logs, one can associate the energy consumption with the procedural level 
using the program counters (in the log) and the symbol table information of 
the compiled program. With such a design, monitoring would be relatively non- 
intrusive - only reading off counters within the timer interrupt mechanism (which 
would be called even in a non-monitored system) . A further benefit is that there 
is no external device to interface with during the interrupts. As a result, the 
monitoring can be done more frequently for better accuracy. 
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CoarseMon can be used as a stand alone platform that can be used to 
drive energy-delay conscious application development and compilation, operat- 
ing/runtime system, communication protocol and architectural design. In addi- 
tion, the energy counters along with their associated run-time support software 
interface can be used to perform dynamic adaptation. For instance, based on 
current conditions, module state transitions can be initiated. Further, the appli- 
cation (or the compiled code) can use them to find out current conditions for 
dynamically changing the code to be executed. 



2.2 FineMon 

The power estimation tool, FineMon, is depicted in Figure 5. FineMon consists 
of a cycle accurate processor datapath simulator, a cache/bus simulator, energy 
models for the various components including clock and memory systems, and 
compiler/OS support tools. At each clock cycle, FineMon simulates the activi- 
ties of all the components and calls corresponding power estimation interfaces. It 
continues the simulation until the predefined program halt instruction is fetched. 
In order to support ‘what if’ architectural level experimentation, the datapath is 
specified only to the RTF level so that many different architectural alternatives 
can be quickly evaluated. In order to keep the simulator technology independent, 
register transfer language (RTF) power estimation interfaces have been devel- 
oped for all the components. These interfaces utilize the technology dependent 
energy models. 




Fig. 5. FineMon Energy Estimation Framework. 
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A prototype of FineMon has already been developed for two processors. One 
was based on the ISA of the Hitachi SH-DSP, a merged DSP/RISC processor 
with on-chip code and data RAM. This prototype was validated by comparing 
its results against measurements made by Hitachi using gate-level power simula- 
tion [27,28]. A more recent prototype of FineMon, SimplePower — a single-issue 
five-stage pipelined architecture based on the SimpleScalar instruction set archi- 
tecture (ISA) [26] with a cache-based memory hierarchy — has been developed 
and used to perform experiments in architectural and compiler optimizations to 
reduce energy consumption [14]. 

3 Hardware versus Software 

Hardware and software techniques to reduce energy consumption have become 
an essential part of current system designs. In this section, we seek answers for 
the following questions: 

— What is the impact of current performance-oriented software optimizations 
(that primarily aim at maximizing data locality and enhancing parallelism 
[6] ) on energy? How do they affect the energy consumption of different system 
components (memory system, datapath, etc.)? 

— What are the relative gains obtained using software and hardware optimiza- 
tion techniques? How can one exploit the interaction between these opti- 
mizations to reduce energy further? 

— Is the most efficient code from the performance perspective the same as that 
for the energy viewpoint? If not, why? 

— How does the impact of these optimizations get affected as a result of antic- 
ipated technological improvements in the future? 

Of course, answering all these questions completely in a such a short article 
is not possible. However, we believe that any progress made in answering them 
will pave the way for our understanding of impacts and interactions of hardware 
and software optimizations. 

3.1 Impact of Software Optimizations on System Energy 

In this section, we evaluate the impact of three widely used high-level compiler 
optimizations on a simple matrix multiply code. The optimizations considered 
are as follows: 

Linear Loop Transformation: The linear loop transformations attempt to im- 
prove cache performance, instruction scheduling, and iteration-level parallelism 
by modifying the traversal order of the iteration space of the loop nest. The 
simplest form of loop transformation, called loop interchange [6], can improve 
data locality (cache utilization) by changing the order of the loops. 

Loop Tiling: Another important technique used to improve cache perfor- 
mance is blocking, or tiling [37]. When it is used for cache locality, arrays that 
are too big to fit in the cache are broken up into smaller pieces (to fit in the 
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cache) and the nested loop in question is restructured accordingly. In the extreme 
case, loop tiling can double the number of loops in the nest. 

Loop Unrolling: This optimization unrolls a given loop, thereby reducing 
loop overhead and increasing the amount of computation per iteration. 



Energy Consumptions. We evaluated the energy consumptions for the ma- 
trix multiply code for different cache topologies (configurations) and program 
versions (each corresponding to different combinations of three optimizations 
mentioned above). The first observation we made is that all optimizations ex- 
cept loop unrolling increase the core power. This is due to the fact that the 
optimized versions generally have more complex loop structures; that, in turn, 
means extra branches and more complex subscript and loop bound calculations. 
Loop unrolling is an exception, as it reduces loop control overhead and enables 
better loop scheduling. 

When considering the memory power, on the other hand, we made the follow- 
ing observations. First, with the increasing cache size and/or associativity, tiling 
performs better than pure linear loop transformations and unrolling. Unlike those 
optimizations, tiling exploits locality in all loop nest dimensions; increasing asso- 
ciativity helps to eliminate conflict misses between different array tiles. Second, 
in the original (unoptimized) code, the memory power is 5 to 47 times larger 
than the core power. However, after some optimizations, this picture changes. In 
particular, beyond a 2K, 2-way set associative cache (i.e., higher associativities 
or larger caches), the core and memory powers become comparable when some 
optimizations are applied. For example, when tiling is applied for a 2K, 4-way 
associative cache, the memory energy is 0.0764 J, which is smaller than the core 
energy, 0.0837 J. Similarly, for the most optimized version (that uses all three 
optimizations), the core and memory energy consumptions are very close for a 
4K, 4-way set associative cache. This shows that when we apply optimizations, 
we reduce the memory energy significantly making the contribution of the core 
energy more important. Since we expect these optimizations (in particular, loop 
tiling) to be applied frequently by optimizing compilers, reducing core power 
using additional techniques might become very important. Overall, the power 
optimizations should not focus only on memory, but need to consider the overall 
system power. In fact, the choice of best optimization for this example depends 
strongly on the underlying cache topology. For instance, when we consider the 
total energy consumed in the system, for a 4K, 2-way cache, the version that 
uses only loop permutation and unrolling performs best. Whereas for an 8K, 
8-way cache, the most optimized version (that uses all three optimizations) out- 
performs the rest. In fact, given a search domain for optimizations and a target 
cache topology, an optimizing compiler can decide which optimizations will be 
most suitable. 

Cache Miss Rates versus Energy Consumptions. We now investigate the 
correlation between cache miss rate and energy consumption. Figure 6 gives the 
miss rates for some selected cases. This subsection will make some correlations 
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Version 




Miss Rates I 


1-way 


2- way 


4- way 


8- way 




IK 


0.1117 


0.1020 


0.1013 


0.1013 




2K 


0.0918 


0.0989 


0.1013 


0.1013 


original 


4K 


0.0737 


0.0330 


0.0245 


0.0150 




8K 


0.0680 


0.0214 


0.0117 


0.0117 




IK 


0.0278 


0.0119 


0.0113 


0.0104 




2K 


0.0185 


0.0107 


0.0099 


0.0099 


linear transformed 


4K 


0.0135 


0.0100 


0.0099 


0.0099 




8K 


0.0118 


0.0099 


0.0099 


0.0099 




IK 


0.0678 


0.0384 


0.0359 


0.0359 




2K 


0.0479 


0.0362 


0.0359 


0.0359 


unrolled 


4K 


0.0358 


0.0198 


0.0145 


0.0173 




8K 


0.0294 


0.0135 


0.0077 


0.0077 




IK 


0.0180 


0.0055 


0.0039 


0.0039 




2K 


0.0105 


0.0028 


0.0016 


0.0016 


tiled 


4K 


0.0046 


0.0016 


0.0012 


0.0013 




8K 


0.0027 


0.0008 


0.0007 


0.0006 



Fig. 6. Miss Rates for the Matrix Multiply Code. 



between miss rates and energy consumptions. Let us first consider the miss rates 
and energy consumption of the original (unoptimized) code. When we move from 
one cache configuration to another, we have a similar reduction rate for energy 
as that for miss rate. For instance, going from IK, 1-way to IK, 2-way reduces 
the miss rate by a factor of 1.10 and reduces the energy by the same factor. As 
another example, when we move from IK, 1-way to 4K, 8-way, we reduce the miss 
rate by a factor of 7.45, and the corresponding energy reduction is a factor of 7.20. 
These results show that the gain in energy obtained by increasing associativity is 
not offset, in general, by the increasing complexity of the cache topology. As long 
as a larger or higher-associative cache reduces miss rates significantly (for a given 
code), we might prefer it, as the negative impact of the additional complexity is 
not excessive. However, we note that when moving from one cache configuration 
to another, if there is not a significant change in miss rate (as was the case in 
our experiments when going from IK, 4- way to IK, 8-way), we incur an energy 
increase. This can be expected as, everything else being equal, a more complex 
cache consumes more power (due to more complex matching logic). 

Next, we investigate the impact of various optimizations for a fixed cache 
(and memory) topology. The following three measures are used to capture the 
correlation between the miss rates and energy consumption of the original and 
optimized versions. 



Miss rate of the original code 

Improvement = , 

Miss rate of the optimized code 
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Memory energy consumption of the original code 

Improvement = , 

Memory energy consumption of the optimized code 



Total energy consumption of the original code 
Improvement^ = . 

Total energy consumption of the optimized code 

In the following discussion, we consider four different cache configurations: 
IK, 1-way; 2K, 4-way; 4K, 2-way; and 8K, 8-way. Given a cache configuration, 
the following table shows how these three measures vary when we move from 
the original (unoptimized) version to an optimized (tiled) version of the matrix 
multiply code. 





IK, 1-way 


2K, 4-way 


4K, 2- way 


8K, 8-way 


Improvement^, 


6.21 


63.31 


20.63 


19.50 


Improvementg 


2.13 


18.77 


5.75 


2.88 


Improvement.^ 


1.96 


9.27 


3.08 


1.47 



We see that in spite of very large reductions in miss rates as a result of 
tiling, the reduction in energy consumption is not as high. Nevertheless, it still 
follows the miss rate. We made the same observation in different benchmark 
codes as well. We have found that Improvementg is smaller than Improvement,,, 
by a factor of 2 - 15. Including the core (datapath) power makes the situation 
worse for tiling (from the energy point of view), as this optimization increases the 
core energy consumption. Therefore, compiler writers for energy-aware systems 
can expect an overall energy reduction as a result of tiling, but not as much 
as the reduction in the miss rate. Thus, optimizing compilers that estimate the 
miss rate (before and after tiling) statically at compile time can also be used 
to estimate an approximate value for the energy variation. The following table 
gives the same improvement measures for the loop unrolled version of the matrix 
multiply code. 





IK, 1-way 


2K, 4- way 


4K, 2-way 


8K, 8-way 


Improvement^, 


1.65 


2.82 


1.67 


1.52 


Improvementg 


2.07 


3.53 


2.07 


1.83 


Improvement.^ 


2.03 


3.37 


1.97 


1.68 



The overall picture here is totally different. First, Improvementg is larger than 
Improvement,,,, which proves that loop unrolling is a very useful transformation 
from the energy point of view. Including the core power makes only a small 
difference, as this optimization reduces the core power as well. We should mention 
that our other experiments (not presented here due to lack of space) yielded 
similar results. We now look at the loop transformed version of the same code: 





IK, 1-way 


2K, 4-way 


4K, 2- way 


8K, 8-way 


Improvement,,, 


4.02 


10.23 


3.30 


1.18 


Improvement^ 


3.42 


8.51 


2.74 


0.99 


Improvement.^ 


3.17 


6.84 


2.32 


0.94 
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Here, Improvement^ closely follows Improvementj^. Including the core energy 
brings the energy improvement down further, as in this example, the loop opti- 
mization results in extra operations for the core. In the experiments with other 
cache configurations, we observed similar trends: Improvement^ generally fol- 
lows Improvementj^^; but it is slightly lower. And, Improvement.^ is smaller than 
Improvementg by a factor of 1.05 to 1.80. 

We can conclude that the energy variations do not necessarily follow miss rate 
variations in the optimized array-dominated codes. More correlations between 
energy behavior and performance metrics can be found in [5]. 

3.2 Relative Impact of Hardware and Software Optimizations 
on Memory Energy 

In this section, we focus specifically on memory system energy due to data ac- 
cesses and illustrate how software and hardware optimizations effect this energy. 

Hardware Optimizations. A host of hardware optimizations have been pro- 
posed to reduce the energy consumption. In this section, we focus on two cache 
optimizations, namely, block buffering [20] and cache subbanking [21]. Note that 
none of these optimizations cause a noticeable negative impact on performance. 
In the block buffering scheme, the previously accessed cache line is buffered for 
subsequent accesses [20]. If the data within the same cache line is accessed on 
the next data request, only the buffer needs to be accessed. This avoids the un- 
necessary and more energy consuming access to the entire cache data and tag 
array. Multiple block buffers can be thought of as a small sized Level 0 cache. In 
the cache subbanking optimization, which is also known as column multiplexing 
[21], the data array of the cache is divided into several subbanks and only the 
subbank where the desired data is located is accessed. This optimization reduces 
the per access energy consumption. 

We studied the energy consumed by the matrix multiply code in the data 
cache with different configurations of block buffers and subbanks (the number 
of block buffers being either 2, 4 or 8 and the number of sub-banks varying 
from 1 to 4) for a 4K cache with various associativities. This result showed that 
increasing the number of sub-banks from one to two provides an energy saving 
of 45% for the data cache accesses. An additional 22% saving is obtained by 
increasing the number of sub-banks to 4. It must be observed that the savings 
are not linear as one may expect. This is because the energy cost of the tag arrays 
remains constant, while there being a small increase in energy due to additional 
sub-bank decoding. We found that for block buffering adding a single block 
buffer reduced the energy by up to 50%. This reduction is achieved by capturing 
the locality of the buffered cache line, thereby avoiding accesses to the entire 
data array. However, access patterns in many applications can be regular and 
repeating across a varied number of different cache blocks. In order to capture 
this effect, we varied the number of block buffers to two, four, and eight as well. 
We observed that, for our matrix multiply benchmark, an additional 17% (as 
compared to a single buffer) energy saving can be achieved using four buffers. 
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We also found that using a combination of eight block buffers and four sub- 
banks, the energy consumed in 4K (16K) data cache could be reduced on an 
average by 88% (89%). Thus, such hardware techniques can reduce the energy 
consumed by processors with on-chip caches. However, if we consider the entire 
memory system including the off-chip memory energy consumption, the energy 
savings from these techniques amount to only 4% (15%) when using a 4K (16K) 
data cache. Thus, it may be necessary to investigate optimizations at the software 
level to supplement these optimizations. 



Combined Optimizations for Memory Energy. It was found that when 
a combination of different software (loop tiling, loop unrolling, and linear loop 
transformations) and hardware (block buffering and subbanking) optimizations 
is applied, tiling performs the best among the three individual compiler op- 
timizations applied in terms of memory system energy across different cache 
configurations. Since, we mentioned earlier that tiling increases the cache energy 
consumption, subbanking and block buffering are of particular importance here. 
For the tiled code, moving from a base data cache configuration to one with 
eight block buffers and four subbanks reduces the overall memory system energy 
by around 10%. Thus, it is important to use a combination of hardware and 
software optimizations in designing an energy-efficient system. 

Further, we observed that the linear loop transformed codes exploited the 
block buffers better than the original code and other optimizations. For exam- 
ple, when using two (eight) block buffers in a 4K 2-way cache, the block buffer hit 
rate was 69% (82%) as compared to the 55% (72%) for the unoptimized matrix 
multiply code. Thus, it is also important to choose the software optimizations 
such that they provide the maximum benefits from the available hardware opti- 
mizations. 

Overall, we observe that even performance based compiler optimizations pro- 
vide a significantly higher energy savings as opposed to those gained using the 
pure hardware optimizations considered. However, a closer observation reveals 
that hardware optimization become more critical for on-chip cache energy re- 
duction when executing optimized codes. We refer the reader to [3] for more 
discussion on this topic. 



3.3 Impact of Software Optimizations on Instruction Cache 

Aggressive compiler optimizations that enhance locality of data accesses tend 
to increase the energy spent due to instruction accesses (as many of these op- 
timizations reduce the instruction reuse). For studying this impact on instruc- 
tion energy, we used four different motion estimation codes (Full Search, 3Step- 
Logarithmic Search, Hierarchical Motion Estimation, and Parallel Hierarchical 
One-Dimensional Search (PH0DS))[33J. These codes also show the importance 
of choosing appropriate algorithms (i.e., application design) for energy savings. 
For instance, among the different algorithms employed to perform the motion es- 
timation, the most data-intensive full search code consumes about 8 times more 
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Fig. 7. Energy Reduction (%) due to High-Level Compiler Optimizations for Data 
Accesses Using Different Cache Configurations (from Top to Bottom, Cache Size of 
8KB, 16KB, 32KB, and 64KB). 




Fig. 8. Energy (J) Consumption due to Instruction Accesses for Two-Way Set- 
Associative Caches. 



energy for data accesses than the most energy-efficient PHODS algorithm when 
using an 8K direct-mapped data cache. 

Further, we observed that, for the direct-mapped data caches, the energy 
expended during data accesses reduces when cache size is increased from 8KB to 
16KB. But, this trend changes with further increase in cache size. This behavior 
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Fig. 9. Energy Reduction (%) for Instrnction Accesses Using Two-Way Set- Associative 
Cache. 



is due to the significant reduction in cache misses when cache size increases from 
8KB to 16KB resulting in fewer energy-expensive memory accesses. However, for 
cache sizes larger than 16KB, the increased per-access cache energy cost (due to 
a larger capacitive load) starts to dominate any benefits from fewer cache misses. 

It was also observed that beyond an instruction cache size of 8KB, most of the 
instruction accesses are captured in the cache. Thus, the number of instruction 
cache misses is small and most of the instruction related energy is consumed in 
accessing the instruction cache. Further, it is observed that the energy cost for 
instruction accesses is comparable to the energy consumed by data accesses for 
most configurations. This observation is important as most of the state-of-the-art 
compiler optimizations currently target only improving data accesses. 

Next, we tried to apply linear loop transformations, loop unrolling and loop 
tiling to the motion estimation codes. In optimizing the motion estimation codes, 
the compiler could not find any opportunities to apply tiling due to imperfectly- 
nested nature of the loops in these codes. In two of the codes, however, it success- 
fully applied loop unrolling with an unroll factor of 5 and 6. When we analyzed 
the resulting optimized C codes, we observed that in all of them, there was an 
expansion in static code size as compared to the original. This is mainly due 
to loop unrolling and scalar replacement exercised by the compiler to improve 
cache and register performance. 

Figure 7 shows the change in energy consumption due to data accesses after 
applying the high-level optimizations. It is observed that the energy reduction 
is most significant for the full search algorithm that is most data-intensive. This 
reduction is due to the significant decrease in number of data accesses as a result 
of improved locality. For example, scalar replacement converts memory references 
to register accesses. However, this also leads to an increase in dynamic instruction 
count. We can also see from Figure 7 that, except for one case, high-level compiler 
optimizations improve the data energy consumption for all motion estimation 
codes in all configurations. The average data energy reduction over all studied 
cache sizes is 30.9% for direct-mapped caches, 39.4% for 2-way caches and 39.8% 
for 4-way caches. Our experiments also show that in hier and parallel_hier. 
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after the optimizations, there is an increase in the number of conflict misses 
(as we do not use array padding). In particular, with parallel Jiier, when 
the cache size is very small and cache is direct-mapped, these conflict misses 
offset any benefits that would otherwise be obtained from improved data locality, 
thereby degrading the performance from the energy perspective. Increasing the 
associativity eliminates this conflict miss problem. 

It can be observed from Figure 9 that the energy consumed by instruction 
accesses increases on an average by 466%, 30% and 32% for the 3step_log, 
hier and parallelJiier optimized codes, respectively. The main reason for 
this increase is the aggressive use of scalar replacement in these codes. While 
this optimization helps caches and registers to exploit temporal data locality, 
the use of scalar replacement in the inner loops of a given nest structure leads 
to significant increase in the dynamic instruction count. For example, in the 
optimized version of hier, dynamic instruction count increased to 62 million 
from 46 million. In contrast, the energy consumed by instruction accesses for 
full_search decreases by 13%. The data access pattern for full_search is 
more regular as compared to the other algorithms. Consequently, the MIPSpro 
optimizer was less aggressive with scalar replacement. Further, the application 
of loop unrolling on full_search reduced the number of branch instructions. 

The overall impact of the optimizations considering both the instruction and 
data accesses was also studied. It was observed that the optimizations decrease 
the energy consumption by 26% for full_search on the average. However, due 
to the detrimental impact on energy consumed by instruction accesses, the over- 
all energy consumption increased by approximately 153%, 11% and 43% for 
3step_log, hier and parallel_hier, respectively. 

3.4 Technological Trends 

We now investigate the relative magnitudes of the core power and the memory 
power for a specific optimization: loop tiling. Figure 10 shows the memory en- 
ergy for different values of Em (energy cost per access) for four different cache 
organizations. Note that Em = 4.95 x 10“®J is a reasonable value for today’s 
technology and is based on the Cypress SRAM CY7C1326-133. The lowest value 
that we experiment with in this section (4.95 x 10“^^) corresponds to the mag- 
nitude of energy per first-level on-chip cache access with current technology. Em 
can be reduced through better process technology, reduction in physical distance 
between memory and core (or using new memory implementation techniques). 
Considering the fact that large amounts of storage capacity are coming closer 
to the CPU, we expect to see lower Em values in the future. This can make the 
energy consumed in the core larger than the energy consumed in memory. Even 
for Em = 4.95 x 10“®, in a 4K, 2-way cache, the two energy values (core and 
memory) are the same. Given the fact that optimizations such as tiling are very 
popular and used by commercial compilers extensively, we predict that research 
(hardware and software) on reducing the core power will become even more im- 
portant. We refer the reader to [4] for a thorough discussion of the impact of 
compiler optimizations with varying energy cost per access values. 




104 



M.J. Irwin et al. 



Conf i- 


Memory Energy (J) 


guration 


4.95 X 10^“ 


2.475 X 10^“ 


4.95 X 10^“ 


2.475 X 10“® 


4.95 X 10^“ 


2.475 X 10“® 


4.95 X 


2.475 X 10“’’ 


IK, 1-way 


0.0164 


0.0462 


0.0836 


0.3821 


0.7553 


3.7408 


7.4727 


37.3280 


IK, 4-way 


0.0090 


0.0154 


0.0234 


0.0872 


0.1671 


0.8056 


1.6038 


7.9892 


4K, 1-way 


0.0194 


0.0270 


0.0364 


0.1119 


0.2062 


0.9611 


1.9047 


9.4533 


4K, 2-way 


0.0183 


0.0210 


0.0243 


0.0507 


0.0837 


0.3477 


0.6778 


3.3183 



Fig. 10. Impact of Different Em Values on Total Memory System Energy Con- 
sumption for Tiled Matrix Multiply. 



4 Future Challenges 

Software content is continuing to form increasing portions of energy-constrained 
systems. Thus, it is of utmost importance to develop a closely intertwined mon- 
itoring and optimizing mechanism involving the OS, compiler and communica- 
tion software to provide an integrated approach to optimizing the overall system 
power. In particular, we see potential for the following areas: 

— Fast and Accurate Energy Models: It remains extremely important to de- 
velop accurate and fast energy models for different system components. Such 
models can be utilized in power estimation and optimization tools (such as 
cycle-accurate energy simulators and profilers), and can also be employed 
in an optimizing compiler framework that specifically targets power. Since 
an optimizing compiler may need to estimate energy for a given code many 
times during compilation, such models should be efficient. In addition, such 
models need to provide accurate information so that they can guide high- 
level and low-level compiler optimizations. 

— Energy-Aware Compilation Framework: It is important to design and imple- 
ment compilation frameworks for high-quality power-aware code generation. 
Such a framework should take into account the power constraints known at 
compile time as well as the power constraints that change dynamically during 
the run time. Among the important optimization problems are minimizing 
memory requirements, improving data locality and optimizing data decom- 
position in multiple memory spaces during static (compile time) power-aware 
compilation, and minimizing bus switching activities. It is also important to 
consider dynamic situations where the compiler does not know the possi- 
ble ranges of power constraints at compile time. In such cases, the compiler 
can obtain dynamic power constraint information from the operating sys- 
tem and can dynamically change the run time activity for reducing power 
consumption. 

— Power-Aware Operating Systems: Operating system can play a major role 
in power reduction by providing feedback to the compiler, architecture and 
communication subsystems regarding dynamic system condition. It can be 
used for both coarse-level and fine-level power monitoring and management. 
We anticipate scheduling, synchronization and memory management tech- 
niques to play a major role in minimizing overall system energy. Already, 






A Holistic Approach to System Level Energy Optimization 105 



there are pointers in literature [30] that illustrate the promising potential of 
such techniques. 

— Power-Conscious Communication System: With increasing mobility of 

power-aware systems, the need for addressing energy optimizations for wire- 
less communication is becoming critical. It is also predicted that the RF 
components associated with communication will dominate the energy bud- 
get of future mobile devices. A coordinated effort between different layers of 
the OS and the communication protocol layers seems to be essential. 

— Unified Optimizations for Energy: So far, majority of the efforts focussed on 
specifically hardware or software. However, the improvements in both areas 
indicate the limitations of these techniques and suggest a unified approach 
that involves both hardware and software. We envision a system in which 
the software is aware of the low power features of hardware components, and 
dynamically adapts itself or the hardware to optimize energy. Similarly, the 
hardware can provide a feedback mechanism to the software that enables 
the latter to initiate dynamic energy optimizations. 

5 Conclusions 

The goal of this study is to investigate the interaction and influence of hard- 
ware and software optimizations on system energy. Towards this goal, we eval- 
uate three widely used high-level compiler optimizations from energy perspec- 
tive considering a variety of cache configurations including conventional direct- 
mapped and associative caches as well as new energy-efficient subbanking and 
block buffering designs. 

Our results show that, as far as reducing the overall system energy is con- 
cerned, software optimizations are more effective. However, they have an impor- 
tant negative effect: they increase the energy consumption on datapath (core) 
and instruction cache. Consequently, hardware-based energy optimizations can 
be used to mitigate that effect. This preliminary study identifies developing 
hierarchical, fast, and accurate energy models as an important area of future 
research. 
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Abstract. Reduction of chip packaging and cooling costs for deep sub-micron 
System-On-Chip fSOC) designs is an emerging issue. We present a simulation- 
based methodology able to realistically model the complex environment in 
which a SOC design operates in order to provide early and accurate power 
consumption estimation. We show that a rich functional test bench provided by 
a designer with a deep knowledge of a complex system is very often not 
appropriate for power analysis and can lead to power estimation errors of some 
orders of magnitude. To address this issue, we propose an automatic input 
sequence generation approach based on a heuristic algorithm able to upgrade a 
set of test vectors provided by the designer. The obtained sequence closely 
reflects the worst-case power consumption for the chip and allows looking at 
how the chip is going to work over time. 



1 Introduction 

In the last years, new technologies allowed to integrate entire systems on a single 
chip, thus causing the appearance of new electronic devices, called System-on-Chips 
(SOCs). SOC products represent a real challenge not just from the manufacturing 
point of view, but even when design issues are concerned. 

To cope with SOC design requirements, researchers developed co-design 
environments, whose main characteristic is to allow the designer to quickly evaluate 
the costs and benefits of different architectures, including both hardware and software 
components. To perform design space exploration, efficient and accurate analysis 
tools are required. In particular, power consumption is a major design issue and thus it 
mandates the availability of effective power estimation tools. Moreover, it is known 
that power analysis and optimization during the early design phases, starting from the 
system level, can lead to large power savings [1], [2], [3]. As a consequence, several 
efforts have been devoted to develop methodologies for system-level power 
estimation. 

Early works on low-power design techniques have mostly focussed on estimating 
and optimizing power consumption in the individual SOC components (application- 
specific hardware, embedded software, memory hierarchy, buses, etc.) separately. 
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Various power estimation and minimization techniques for hardware at the transistor, 
logic, architecture, and algorithm levels have been developed in the recent years, and 
are summarized in [1], [2], [3], [4], [5]. 

Recently, researchers have started investigating system-level trade-off and 
optimizations whose effects transcend the individual component boundaries. 
Techniques for synthesis of multiprocessor system architectures and heterogeneous 
distributed HW/SW architectures for real-time specifications were presented in [6], 
[7]. In [8] and [9], separate execution of an instruction set simulator (ISS) based 
software power estimator, and a gate-level hardware power estimator were used to 
drive exploration of tradeoffs in an embedded processor with memory hierarchy, and 
to study HW/SW partitioning tradeoffs. The traces for the ISS and hardware power 
estimator were obtained from timing-independent system-level behavioral simulation. 
To cope with the heterogeneous components SOCs usually embed, in [10] a tool is 
proposed based on the concurrent and synchronized execution of multiple power 
estimators that analyze different parts of the SOC, driven by a system-level simulation 
master. 

To assist designers in defining a suitable input streams for power estimation 
purposes, we developed an algorithm that improves an initial input sequence (either 
provided by designers or randomly generated) so that it activates all the functions of 
the system while trying to maximize the power it consumes. As a result, being the 
sequence able to exhaustively activate the whole system, more accurate power figures 
can be obtained. 

To generate suitable input sequences for SOCs where hardware and software tasks 
are mixed together, we need a system representation that abstracts architectural 
details. We achieve this goal by developing our algorithm in a co-design environment. 
The benefits that stem from this solution are twofold: 

1 . by abstracting the behavior from the architecture, we deal with a high-level system 
description that we can simulate with a very low cost in terms of CPU time; 

2. the sequences we compute are reusable, i.e., the same sequence can be used during 
power estimation at every level of abstraction. This allows us to use the developed 
test sequences, for example, for evaluating the power consumption of both the 
algorithm and the architecture implementing it. 

A prototype of the proposed input sequence generation technique has been 
implemented using the POLIS [11] co-design tool. It is based on a heuristic algorithm 
that automatically generates an input sequence able to exercise as much as possible of 
the specification, by interacting with a simulator executing a specification of the 
system under analysis. 

The remainder of the paper is organized as follows. In Section 2 we motivate our 
work. In Section 3 we describe the system representation we exploited. Section 4 
describes the optimization we developed, while Section 5 reports some preliminary 
experimental results assessing the effectiveness of the proposed approach. Finally, 
Section 6 draws some conclusions. 



2 Motivations 

In a SOC design, the entire system is implemented using a single chip module. Due to 
this fact, the power budget requirements will be dictated by the most power 
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consuming modules present on the chip and this will determine the cooling 
requirements of the entire chip. This is radically different with respect to a multi-chip 
implementation in which each chip component can a priori have different power 
budget and cooling requirements. In order to cope with this scenario it is necessary to 
provide early and accurate power estimation capabilities in the system level design 
methodology. 

It has already been demonstrated [10] that power consumption in hardware and 
software cannot be addressed separately due to the effects of system resources that 
they share (such as the cache and the buses). The most promising solution is the use 
of high-level hardware/software co-simulation in order to address power analysis in 
hardware and software concurrently. 

The approach that is presented in this paper is based on enriching a co-simulation 
based power estimation methodology with a tool able to generate vectors in order to 
improve the coverage of the functional specification. 

It is important to underline that the input sequence generator deals with the entire 
design and not with single modules separately. The reason behind this choice is that if 
the test generator can deal with the entire design, the resulting tests will only include 
vectors that are possible in normal functional modes. On the other hand, if test 
generation is performed on single modules, the generated vectors may contain illegal 
sequences that cannot be applied to the module from the primary inputs of the design. 

Moreover, since the system-level power estimation techniques proposed so far are 
essentially simulation-based ones, they require the system to be simulated under 
typical input sequences provided by designers. The definition of a proper set of input 
stimuli is a very time consuming and difficult task, since all the details of the design 
must be understood for generating suitable input sequences. The right trade-off 
between designer’s time and power estimation accuracy is often difficult to find, and 
this often results in power figures that underestimate the actual power consumption. 
Moreover, in the generation of typical input sequences the designer may be biased by 
his knowledge of the desired system or module behavior, so that he often fails in 
identifying input sequences really able to activate possible critical points in the 
description. 



3 System Representation 

In POLIS the system is represented as a network of interacting Co-design Finite State 
Machines (CFSMs). CFSMs extend Finite State Machines with arithmetic 
computations without side effects on each transition edge. The communication edges 
between CFSMs are events, which may or may not carry values. A CFSM can 
execute a transition only when an input event has occurred. 

A CFSM network operates in a Globally Asynchronous Locally Synchronous 
fashion, where each CFSM has its own clock, modeling the fact that different 
resources (e.g., HW or SW) can operate at widely different speeds. CFSMs 
communicate via non-blocking depth-one buffers. Initially there is no relation 
between local clocks and physical time, that is defined later by a process called 
architectural mapping. 
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This involves allocating individual CFSMs to computation resources and assigning 
a scheduling policy to shared resources. CFSMs implemented in hardware have local 
clocks that coincide with the hardware clocking. CFSMs implemented in software 
have local clocks with a variable period, that depends both on the execution delay of 
the code implementing each transition and on the chosen scheduling policy (e.g., 
allowing task preemption). 

For each CFSM in the system description, a Control Flow Graph representation 
(Fig. 1) is computed, called S-Graph. 



BEGIN 



check point 







Fig. 1. An example of S-Graph 

An S-Graph has a straightforward and efficient implementation as sequential code 
on a processor. In the C code that POLIS is able to generate from the S-Graph 
representation, each statement is almost in a 1-to-l correspondence with a node in the 
S-Graph. Thus the model can be simulated as native object code very efficiently. 

Our goal is to compute input sequences that exercise as much as possible of the 
system specification, while trying to maximize the power consumption; we thus adopt 
statement coverage as a metric of the activity in the specification. Moreover, as far as 
power estimation is concerned, we characterize each statement with a value 
representing the cost in terms of energy required to execute the statement. 

In order to compute the adopted metric and to gather information during 
simulation of an input sequence, we instrument the simulation model by inserting; 

1. trace points associated to each statement in the S-Graph. A trace point is a 
fragment of code that records the number of times the statement has been 
executed; 

2. check points associated to each test in the S-Graph. A check point is a fragment 
of code that records the number of times the test has been executed. 
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During simulations, we use trace points to evaluate statement coverage, while we 
check points to direct the search towards repeated traversal of test nodes that have still 
uncovered outgoing branches. 

In our simulation model, we assume that a new vector is placed on the system 
inputs if and only if the response to the previous input has already been computed and 
the system is in a steady state. 



4 Adopted Algorithm 



The architecture of the system we developed is shown in Fig. 2. 



System Simulator 




Fig. 2. System architecture 



A Sequence Generator computes some input vectors and sends them to the System 
Simulator. The model specification is then simulated and the value of the adopted 
metric is computed. Finally, this value is sent back to the Sequence Generator. By 
exploiting this feedback architecture, the Sequence Generator can grade input vectors 
according to the associated metric figures. 

sequence hill_climber ( sequence initial_sequence ) 

{ sequence current_solution = initial_sequence ; 
int iter, weight, new_weight; 
weight = evaluate ( current_solution ) ; 
for ( iter = 0; iter < MAX_ITER; iter++ ) 

{ modify ( current_solution ) ; 

new_weight = evaluate ( current_solution ) 
if ( new_weight > weigth ) 
weight = new_weight; 
else 

revert_modif ication ( current solution ) ; 

} 

return ( current_solution ) ; 

} 



Fig. 3. The optimization algorithm 
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Given a metric to measure the goodness of a seqnence of input vectors, we adopted 
a heuristic algorithm to implement the Sequence Generator, whose goal is to find a 
sequence that maximizes the value of the adopted metric. In order to rense the 
information, e.g., input sequence, already provided hy designers we adopted a random 
mutation hill climber, whose pseudo-code is reported in Fig. 3. The algorithm 
randomly modifies the initial sequence provided by designers and evalnates it. A 
sequence is accepted if and only if it improves the adopted metric. 

We apply a sequence of vectors to the system inputs. Each vector is a set of events 
that are concurrently applied to the system input at a given time. We coded the 
sequence as a matrix of bits, where SEQUENCE_LENGTH is the number of rows in 
the matrix and thus it represents the number of vectors to be applied on the system 
inputs. Conversely, N_INPUTS is the nnmber of system inputs and thus the number 
of columns in the matrix. The number of bits use to represent an inpnt event e is 
selected as follows: 

1 . 1 bit if e is an input event without value 

2. logjM bits, where n is the number of different values associated to the event e, if e is 
an input event with value. 

Given and initial solution, we randomly modify it and accept the new sequence if 
and only if it improves the metric we adopted. 

The metric we use in this paper is defined as follows: 

( N M Nj ^ 



f(S) = K,- 



{\-OPi)NTj 



+ K 2 ■ Power{S ) 



( 1 ) 



i=0 



j=Q i=0 



) 



Where: 

1 . 5 is the input sequence to be evaluated; 

2. N is the number of trace points; 

3. M is the number of check points; 

4. N- is the number of trace points associated to checkpoint j\ 

5. OP. is equal to 1 if the trace point i has been traversed, 0 otherwise; 

6. NT. is the nnmber of times the check point associated to the test j has been 
executed during the simulation of the input sequence S; 

7. Power(S) is the power consumption of the sequence S; 

8. C;, Cj, Kj and are constants. 

The metric is intended to maximize the number of statements covered by the input 
sequence, while maximizing the power consumption. This is motivated by observing 
that a fair measure of the system power consumption can be attained only if the input 
sequence is capable of effectively activating the entire system. An input sequence that 
fails in activating a subset of the system can lead to power figures that do not reflect 
actual power consumption. 

The first part of (1) measures how many statements the sequence S traverses and 
tends to favor sequences that execute those tests whose outgoing branches have not 
yet been covered. In order to preserve the already covered statements while trying to 
cover new ones the first part must dominate the second (in the experiments we used 
C, =1,000 and C=10). Conversely, the second part of (1) is intended to take into 
account the power consumed by the application of S to the system. In particnlar, this 
term tries to favor those vectors increasing the power consnmption. 

Given the previously reported considerations, the number of covered statements 
prevails over the power consumption and therefore » K^. 
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5 Experimental Results 



We implemented a prototype of the proposed algorithm, called Hill Climber Test 
Bench Generator (HC-TBG), in C language. 

Using this prototype, we performed a set of experiments, whose purpose was to 
assess the effectiveness of the proposed approach; the preliminary experiments have 
been run on a set of small benchmarks. All the results have been gathered on a Sun 
UltraSparc 5/360 running at 360 MHz and equipped with 256 Mbytes of RAM. 

We considered three control-dominated benchmarks: a belt control system, a traffic 
light controller and a dashboard system, whose characteristics are reported in Table 1, 
in terms of number of CFSMs and number of statements for each CFSM. 

Table 2 reports the results gathered with our algorithm. We have compared the 
metric figures our algorithm attains with the ones attained by random sequences and, 
when available, with functional vectors provided by designers. 

In Table 2, the column Benchmark reports the system under analysis, Vec reports 
the number of vectors in the input sequence, while CPU reports the time spent for 
running HC-TBG. The remaining columns report the statement coverage (S) and the 
energy consumption (E) attained by respectively HC-TBG, Random and Functional 
generated sequences. 



Table 1. Benchmarks characteristics 



Benchmark 


CFSM 


Statements 

[#1 


Belt Controller 


BELT_CONTROLLER 


31 


TIMER 


25 


Traffic Light Controller 


CONTROLLER 

TIMER 


66 

13 




BELT 


25 




DISPLAY 


73 




FRC 


26 


Dashboard 


FUEL 


35 


ODOMETER 


18 




TIMER 


75 




SPEEDOMETER3 


17 




SPEEDOMETER4 


17 



As shown in Table 2, HC-TBG sequences are far more effective than random 
generated one, and better than the functional ones. 

To better investigate the effectiveness of the approach we propose, we carried out a 
second set of experiments on the Dashboard benchmark. In particular, we selected 
two partitioning and synthesized the corresponding hardware/software systems. We 
then simulated the systems with the HC-TBG and functional sequences already 
adopted in the previous experiment. Table 3 summarizes the attained results, where 
the energy breakdown is reported. In all cases HC-TBG is able to provide results at 
least comparable with those obtained with functional vectors. For 3 CFSMs out of 8 
(implemented either in software or in hardware), HC-TBG attains an energy 
estimation that is 2 or 3 orders of magnitude higher than that attained by functional 
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vectors, thus prompting the importance of having input sequences able to activating 
most of the system functionality. For the remaining CFSMs the two approaches attain 
comparable results; in these cases we can thus conclude that the functional vectors 
provided by designers are already able to produce good power estimations. 



Table 2. Statement coverage and Energy results 





Vec 

[#] 


CPU 

[s] 


HC-TBG 


Random 


Eunctional 


Benchmark 


E 


S 


E 


S 


E 


s 




[#] 


[%] 


[#] 


[%] 


[#] 


[%] 


Belt 

Controller 


1,000 


408 


2,945,741 


89.3 


2,433,689 


52.7 


n.a 


n.a. 


Traffic 


















Light 


1,000 


441 


3,602,133 


94.9 


1,819,476 


83.5 


n.a. 


n.a. 


Controller 

Dashboard 


1,000 


12,696 


20,311,073 


80.4 


16,384,701 


72.7 


14,061,184 


71.7 



Table 3. The synthesized version of Dashboard 



CESM 


Partitioning 1 


Partitioning 2 


Impl. 


HC-TBG 

[itJ] 


Eunctional 

ruJi 


Impl. 


HC-TBG 

[itJ] 


Eunctional 

[itJ] 


BELT 


SW 


1131.9 


2.1 


HW 


5.M0-‘ 


9. MO' 


DISPLAY 


SW 


6.1 


8.7 


SW 


6.2 


8.7 


FRC 


SW 


3736.3 


3736.3 


SW 


3736.3 


3736.3 


EUEL 


SW 


2004.4 


1930.6 


SW 


2008.9 


1930.6 


ODOMETER 


SW 


2033.3 


41.9 


HW 


1.7- 10'" 


9.8-10“ 


TIMER 


SW 


1951.6 


1951.6 


SW 


1951.6 


1951.6 


SPEEDOMETER3 


SW 


1671.2 


34.7 


HW 


3.110'" 


1.1-10“ 


SPEEDOMETER4 


SW 


1667.8 


168.4 


HW 


3.110'" 


3.7-10“ 


TOTAL 




12,198.2 


7,874.3 




7,703.0 


7,627.2 



The energy figures for Partitioning 1 are also compared in Fig. 4. 



6 Conclusions 

This paper proposed an algorithm for computing input sequences intended for 
simulation-based system-level power estimation of SOC design. 

The approach is able to upgrade test vectors given by designers with ad-hoc 
vectors generated by a heuristic algorithm able to cover much more extensively the 
system-level specification with particular emphasis dedicated to the most consuming 
parts. 















































116 M. Lajolo et al. 



All Software 




□ HC-TBG 
■ Functional 



Fig. 4. Comparing energy figures 



The algorithm can be exploited since the early design phases; it indeed deals with 
the system behavior only, while its architecture is neglected. Moreover, the sequences 
it produces can be exploited in the following design phases, when a more detailed 
description of the system is available, and thus providing more accurate power 
estimation figures. 

We have presented experimental results that show a difference of 2 or 3 orders of 
magnitude on a reasonably complex case study, which confirms the usefulness of the 
methodology. 

The methodology proposed can be very useful in order to model much more 
extensively the environment in which the system operates by taking into account 
more input sequences with respect to the ones that can be thought by the designer. 

Moreover, an automatic test bench generation approach can also be useful in order 
to predict the power dissipated in the chip during the SOC manufacturing test (the test 
at the end of the production of the chip), where the activity produced in the chip, and 
hence the chip power consumption, can be much higher than during normal operation. 
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Abstract. Coarse-grain reconfigurable architectures promise to be more 
adequate for computational tasks due to their better efficiency and higher speed. 
Since the coarse granularity implies also a reduction of flexibility, a universal 
architecture seems to be hardly feasible, especially under consideration of low 
power applications like mobile communication. Based on the KressArray 
architecture family, a design-space exploration system is being implemented, 
which supports the designer in finding an appropriate architecture featuring an 
optimized performance / power trade-off for a given application domain. By 
comparative analysis of the results of a number of different experimental 
application-to-array mappings, the explorer system derives architectural 
suggestions. This paper proposes the application of the exploration approach for 
low power KressArrays. Hereby, both the interconnect power dissipation and the 
operator activity is taken into account. 



1 Introduction 

Many of today’s application areas, e.g. mobile communication, require very high perfor- 
mance as well as a certain flexibility. It has shown, that the classic ways of realizing the 
associated algorithms, ASIC implementation and microprocessors, are often not ade- 
quate, as microprocessors cannot provide the performance, while ASIC implementations 
lack flexibility. 

As a third way of implementation, reconfigurable computing has gained importance 
in the recent years. It is expected, that in order to obtain sufficient flexibility for high pro- 
duction volume most future SoC implementations need some percentage of reconfig- 
urable circuitry [1]. Reconfigurable computing even has the potential to question the 
dominance of the microprocessor [2]. 

Early approaches in reconfigurable computing were based on Field-Programmable 
Gate Arrays (FPGAs), which provide programmability at bit-level. These solutions soon 
turned out to have some disadvantages for computing applications, as complex operators 
have to be composed from bit-level logic blocks. This leads to the following drawbacks, 
among others: 

• A large amount of configuration data is needed, which makes fast 
configuration hard and increases power dissipation during configuration. 

• As the operators are made of several logic blocks, regularity is mostly lost 
and an extra routing-overhead occurs. This extra routing also increases the 
power dissipation of the circuit during run-time. 
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To encounter the disadvantages of FPGA-based solutions, coarse-grain reconfig- 
urable architectures have been developed for computational applications [3], [4], [5], [6], 
[7], [8], [9], [10]. These devices are capable of implementing high-level operators in their 
processing elements, featuring multiple-bit wide datapaths. Coarse grain reconfigurable 
architectures avoid several drawbacks of FPGAs. Featuring relatively few powerful oper- 
ators instead of many logic blocks, coarse grain architectures need much less configura- 
tion data. Also, as the processing elements can be implemented in an optimized way, 
both the expected performance and the power dissipation are lower than for FPGAs, as 
shown in [11]. However, for the use of a coarse granularity some problems still have to 
be solved, to cope with the reduced flexibility compared to FPGA-based solutions: 

• Processing elements of coarse-grain architectures are more "expensive" than 
the logic blocks of an FPGA. While it is possible for FPGA mappings, that a 
number of logic blocks is unused or cannot be reached by the routing 
resources, especially the latter situation is quite annoying for coarse-grain 
architectures due to the fewer processing elements of higher area consumption. 

• While the multi-bit datapath applies well to operators from high-level 
languages, operations working on smaller word-lengths and especially bit 
manipulation operations are either weakly supported, or need a sophisticated 
architecture. If such operations occur, like e.g. in data compression 
algorithms, they may result in a complex implementation requiring several 
processing elements, or, in difficult architectural requirements. 

• Although the power consumption caused by the routing resources can be 
expected to be lower than for FPGAs [11], the problem of a careful 
architectural design of the interconnect network, which provides both low 
power and adequate flexibility remains. 

Due to these problems, the selection of architectural properties like datapath width, 
routing resources and operator repertory is a general problem in the design of coarse- 
grain architectures. Thus, a design space exploration is done in many cases, to determine 
suitable architectural properties. As the requirements to the architecture are mostly 
dependent on the set of typical applications to be mapped onto it, previous efforts use 
normally a set of example applications, e.g. DBS encryption, DCT, or FIR filters. Exam- 
ples for such approaches are published in [9] and [10]. According to these methods. 




Fig. 1. Three levels of interconnect for KressArray architectures: a) nearest neighbor links, 
b) backbuses in each row or column (for more examples see figure 2), c) serial global bus. 
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coarse-grain architectures are often optimized for a specific application area, like DSP or 
multimedia applications. 

While most of these explorations focus on performance or area, power considerations 
are often thrown over the wall from architectural exploration to physical design. But in 
regard to interconnect networks, Zhang et al. have presented a comparison and analysis 
of reconfigurable interconnect architectures using energy as a metric in [12], giving gen- 
eral results for DSP applications. However, we feel that optimized architectures can also 
be found for more specific application domains, which are defined by a number of sam- 
ple applications. 

To find a suitable architecture for a given domain, the KressArray Xplorer framework 
is currently being implemented. The framework uses the KressArray [7] [8] architecture 
family as basis for an interactive exploration process. When a suitable architecture has 
been found, the mapping of the application is provided directly. The designer is sup- 
ported during the exploration by suggestions of the system how the current architecture 
may be enhanced. This paper proposes the application of this framework for power- 
aware architecture exploration, with the discussion of related issues. 

The rest of this paper is structured as follows: To give an overview on the topic, the 
next two sections briefly sketch the KressArray architecture family and the design space 
for the exploration process published elsewhere [13]. In section 4, our general approach 
for an interactive design space exploration for an application domain is presented. After 
this, a short overview on the KressArray Xplorer framework is given. The next section 
outlines our approach for the generation of design suggestions, which can be used to 
incorporate the models for power estimation presented in the following section. Finally, 
the paper is concluded. 



2 The KressArray Architecture Family 

The KressArray family is based on the original mesh-connected (no extra routing areas, 
see figure 2 d, e) KressArray- 1 (aka rDPA) architecture published elsewhere [8]. An 
architecture of the KressArray family is a regular array of coarse grain reconfigurable 
DataPath Units (rDPUs), each featuring a multiple-bit datapath and providing a set of 
coarse grain operators. The original KressArray- 1 architecture provided a datapath of 32 
bits and all integer operators of C, the proposed system can handle also other datapath 
widths and operator repertories. The different types of communication resources are 
illustrated in figure 1 . There are three levels of interconnect: First, a rDPU can be con- 
nected via nearest neighbor links to its four neighbors to the north, east, south and west. 
There are unidirectional and bidirectional links. The data transfer direction of the bidirec- 
tional ones is determined at configuration time. Second, there may be backbuses in each 
row or column, which connect several rDPUs. These buses may be segmented, forming 
several independent buses. Third, all rDPUs are connected by one single global bus, 
which allows only serial data transfers. This type of connection makes only sense for 
coarse grain architectures with a relatively low number of elements. However, a global 
bus effectively avoids the situation, that a mapping fails due to lack of routing resources. 
The rDPUs themselves can serve as pure routing elements, as an operator, or as an oper- 
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ator with additional routing paths going through. Some more communication architec- 
ture examples are shown in figure 2. 

The number and type of the routing resources as well as the operator repertory are 
subject of change during the exploration process. Typically, a trade-off has to be found 
between the estimated silicon area, the performance, and the power dissipation of the 
architecture, where both performance and power dissipation will typically depend on the 
application to be implemented. 



The KressArray Design Space 



The KressArray structure defines an architecture class rather than a single architecture. 
The class members differ mainly by the available communication resources and the oper- 
ator repertory. Both issues have obviously a considerable impact on the performance of 
the architecture. In the following, we define the design space for KressArray architec- 
tures based on the introduction given in section 2. 

The following aspects of a KressArray architecture are subject to the exploration pro- 
cess and can be modified by the tools of the exploration framework: 
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Fig. 2. KressArray communication 
architecture by examples: a) 4 recon- 
figurable nearest neighbor ports (rNN 
ports), b) 8 rNN ports, c) 10 rNN 
ports, d) reconfigurable Data Path 
Unit (rDPU, compare fig. c), use for 
routing only; e) rDPU use for func- 
tion and routing, f) 2 global back- 
buses per row, g) segmented single 
backbuses per column, h) 2 buses per 
column, 3 per row, i) different func- 
tion sets in alternating columns. 
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• The size of the array. 

• The operator repertory of the rDPUs. 

• The available repertory of nearest neighbor connections. The numbers of 
horizontal and vertical connections can be specihed individually for each 
side and in any combination of unidirectional or bidirectional links. 

• The torus structure of the array. This can be specihed separately for each 
nearest neighbor connection. The possible options are no torus structure or 
torus connection to the same, next or previous row or column respectively. 

• The available repertory of row and column buses. Here, the number of buses 
is specihed as well as properties for each single bus: The number of 
segments, the maximal number of writers, and the length of the hrst segment, 
which allows buses having the same length but spanning different parts of the 
array. 

• Areas with different rDPU functionality. For example, a complex operator 
may be available only in specihc parts of the array. This allows also the 
inclusion of special devices in the architecture, like embedded memories. 
The operator repertory can be set for arbitrary areas of the array, using 
generic patterns described by few parameters. 

• The maximum length of routing paths for nearest neighbor connections, 
which can be used to satisfy hard timing or power constraints. 

• The number of routing paths through a rDPU. A routing path is a connection from an 
input to an output through a rDPU, which is used to pass data to another rDPU. 

• The interfacing architecture for the array. Basically, data words to and from 
the KressArray can be transferred by either of three ways: Over the serial 
global bus, over the edges of the array, or over an rDPU inside the array, 
where the latter possibility is mostly used for library elements. 

In order to find a suitable set of these properties for a given application domain, an 
interactive framework is currently developed [13]. The framework, called KressArray 
Xplorer, allows the user a guided design of a KressArray optimized for a specified problem. 
At the end of the design process, a description of the resulting architecture is generated. 

4 General Approach to Design Space Exploration 

The design flow of design space exploration for a domain of several applications is illus- 
trated by figure 3. In most cases a cycle through the loop takes only a few minutes, so 
that a number of alternative architectural designs may be created in a reasonable time. 
First, all applications are compiled into a representation in an intermediate format, which 
contains the expression trees of the applications. All intermediate files are analyzed to 
determine basic architecture requirements like the number of operators needed. The main 
design space exploration cycle is interactive and meant to be performed on a single appli- 
cation. This application is selected from the set of applications in a way, that the opti- 
mized architecture for the selected application will also satisfy the other applications. 
This selection process is done by the user, with a suggestion from the system. For low 
power exploration, two application properties can be considered: Regularity and the esti- 
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mated power consumption without regarding the component of the routing architecture. 
An approach to measure the regularity of an application has been published in [14], The 
power estimation can be derived from scheduled data flow graphs using a methodology 
described in [15] and [16]. 



The exploration itself is an interactive process, which is supported by suggestions to 
the designer, how the current architecture could be modified. The application is first 
mapped onto the current architecture. The generated mapping is then analyzed and statis- 
tic data is generated. Based on this data, suggestions for architecture improvement are 
created using a fuzzy-logic based approach described below. The designer may then 
chose to apply the suggestion, propose a different modification, return to a previous 
design version, or end the exploration. Some modifications allow the new architecture to 
be used directly for the next mapping step, while others will require a re-evaluation of the 
basic architecture requirements and/or a re- selection of the application for the explora- 
tion. Especially, a change of the operator repertory for the rDPUs requires the replace- 
ment of subtrees in the dataflow graph, thus effecting the number of required rDPUs in 
the array, complexity of all applications, and the power consumption. Thus, for a change 
of the operator repertory, a re-evaluation is required. 




Fig. 3. Global approach for domain- 
specific architecture optimization. 



After the exploration cycle has ended, the final archi- 
tecture has to be verified by mapping the remaining 
applications onto it. On the one hand, this step pro- 
duces mappings of all applications to be used for 
implementation, while on the other hand, it is 
checked if the architecture will satisfy the require- 
ments of the whole application domain. 

5 The KressArray Xplorer 

This section will give a brief description of the 
components of the KressArray Xplorer, which has 
been published elsewhere [13]. An overview on the 
Xplorer is shown in figure 4. The framework is based 
on a design system, which can handle multiple 
KressArray architectures within a short time. 

It consists of a compiler for the high-level language 
ALE-X, a scheduler for performance estimation, and 
a simulated-annealing based mapper. This system 
works on an intermediate file format, which contains 
the net list of the application, delay parameters for 
performance estimation, the architecture description, 
and the mapping information. The latter is added by 
the mapper at the end of the synthesis process. An 
architecture estimator determines the minimum 
architecture requirements in terms of operator num- 
bers and suggests the application with the expected 
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Xplorer components 
including proposed 
Power Estimation (see 
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worst requirements to power and routing resources to the user through an interactive 
graphical user interface. This interface is also generally used to control all other tools. 
Further, it contains two interactive editors, an architecture editor, which allows to change 
the architecture independently by the design suggestions, and a mapping editor, which 
allows to fine-tune the result of the mapper. An analyzer generates suggestions for archi- 
tecture improvements by information gathered directly from the mapping and other 
sources. An instruction mapper allows the change of the operator repertory by exchang- 
ing complex operators in the expression tree with multi-operator implementations. A 
simulator allows both simulation of the application on the architecture as well as genera- 
tion of a behavioral HDL (currently Verilog) description of the KressArray. Finally, a 
Module Generator (planned) should generate the final layout of a KressArray cell. 
Throughput estimations are generated from Scheduler results and statistical data. From 
Module Generator parameters also an area estimation can be generated easily. 



6 Generation of Design Suggestions 

In this chapter, we will give a short overview on our approach to analysis and generation 
of design suggestions, which is performed by the analyzer tool of the Xplorer frame- 
work. The problem of the generation of feedback on a given design can be split into sev- 
eral subproblems: 

• Analysis of the current architecture and gathering of information. This 
includes the combination of data to gain derived information. 

• Generation of suggestions from this information. 
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• Ranking of the suggestions after their importance. 

In our approach, the basis for the information gathering step is the mapping produced 
by the design system. Primary data gathered includes the usage of buses and nearest- 
neighbor connections in percent, the number of serial bus connections, the estimated 
number of execution cycles, and others. This primary data can be used to derive second- 
ary information, like the estimated power dissipation, based on the mapping and the 
actual usage of routing resources by the application. 

To allow the gathering of a variety of information from the mapping, the analyzer 
tool is implemented using a plug-in-based approach, which provides also the flexibility 
to extend the system. The plug-ins are controlled by the analyzer tool, which holds also 
data structures for the analysis results. 

The design suggestions themselves are then generated using an approximate reason- 
ing approach based on fuzzy logic [17], [18]. The knowledge how to guide the explora- 
tion process is expressed in implication rules, like known from expert systems. The fuzzy 
approach has been chosen to allow the generation of suggestions based on inexact infor- 
mation, as during the exploration process, it is assumed the designer will apply a 'fast' 
mapping (by tuning the parameters of the simulated annealing), which will probably 
result in a mapping with a lower quality. 



7 Power Estimation for the Exploration Process 

In this section, we propose an approach to be used to generate a power estimation. Given 
an application and a KressArray architecture, onto which this application has been 
mapped, the total power consumption is composed from the power consumed by the 
operators and the power consumed by the interconnect network. We will now discuss 
how to estimate these in a way adequate for the exploration process. Note, that for the 
exploration process, a very accurate estimation of the power consumption is not neces- 
sary. Instead, we will propose simplified measures, which allow a relative quantitative 
comparison of power dissipation of different architectures. 

The operator component is determined by the operator repertory (taken from the 
intermediate form seen in figure 4), by the implementation of the operators (to be derived 
from Module Generator parameters), by the configuration (indicating which operator has 
been selected) and by the switching activity (to be extracted from the intermediate form 
(or from the HDL description) and the Scheduler results: see figure 4). 

In our Xplorer framework, we assume the operator repertory to be organized in sev- 
eral sets, which can be switched by the instruction set mapper mentioned in section 5, 
and that according power models for the final implementation of those sets are available 
from the library of the Module Generator (figure 4). With these preliminaries, the power 
consumption for a given application can be estimated by techniques like those presented 
in [15], [16], since the required data flow graph can be extracted from the intermediate 
format (s. figure 4). The resulting estimation does not consider the routing architecture or 
an actual mapping of the application at all (however, see next paragraph). Though it can 
be used to distinguish applications in order to select the one for the exploration process 
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described in section 4. This selection takes place at the beginning and when the operator 
set has been changed. 

The power consumption caused by the routing network depends also on the actual 
use of the routing resources during run-time. However, for our purposes, we can employ 
a more relaxed metric for the different routing resources of KressArrays. Instead of the 
power, we use the energy of the routing resources for measure, neglecting the actual 
usage during execution. Generally, for the energy consumption E of an interconnect 
structure, the following estimation can be used (a modified version of the model in [12]) 
for a net con starting from an operator op in the mapping: 



E(con) = ( Fmout^p • (1) 

where and are the according capacitances of the wire segments and 

switches for each routing resource. Fanout is the fan-out of the operator resembling the 
source of the net, is the load capacitance of an operator input, and V is the supply 
voltage. For our purposes, we can simplify this equation to: 

E(con) = Ki' L+ Kg • NR • S + Fanoutgp • • NR (2) 

Where is the energy per wire segment, L is the number of segments used, Kg is the 
energy per switch, NR is the number of all routing resources meeting at an rDPU which 
determines the width of the required switch, S is the number of switches, and is the 
energy per fan-out connection. The values K^, Kg, and can be assumed to be con- 
stant and known for each interconnect type. Then, we get the following estimates for the 
energy for each routing resource: 

Global Bus. For this type of connection, the capacity of the whole bus structure 
has to be considered, as this bus is not switched, but operates in a serial manner. The 
capacity is dependent of the array size and the actual layout of the bus structure. For a 
relative measure between different architectures, we can assume a general layout like in 
figure Ic. If the Sizes of the array in x and y-direction are denoted as ASy. and ASy respec- 
tively, the number of bus segments is approximated by: 

L = ASyc • ( ASy + 1 ) , which we simplify to L = ASy. • ASy to keep the measure inde- 
pendent from the aspect ratio of the array. Thus, we get: 

Egiobal( con) = Ki^-ASy,-ASy + Fanoutgp • -NR (3) 

for a global bus connection con with source operator op. 

Row/Column backbuses. For those connections, both source and one or more 
sinks lie on one bus segment, which in itself is not switched. Thus, we get: 

Ehackhusi con) = Ki • Segment) op) Fanoutgp • -NR (4) 

with Segment(op) denoting the length of the bus segment holding the source opera- 
tion op. 
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Nearest Neighbor Connects. Using the 

nearest neighbor connects, a connection from one 
source operator to several sinks can be implemented, 
resulting in a path composed of several subpaths (cf. 
figure 5). There is a subpath for each sink composed 
of a sequence of length- 1 connections, whereby dif- 
ferent subpaths do not share such segments and each 
rDPU lying on this sequence inflicts additional 
switching energy. We get the following estimation 
for a path made up of Fanoutgp subpaths, each of 
which with the according subpath length SPL: 
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Fig. 5. Path composed of 
three subpaths 



Fanout 

E^^(con) = X (SPL, ■ + {SPL, -l) K, NR + ■ NR) 

i — 1 

According to the software architecture of the framework outlined in section 5, the 
implementation of these estimation functions can easily done as plug-ins, which use 
available data from the intermediate file, given the hardware-parameters are provided. 

8 Conclusions 

An interactive approach for the design-space exploration of mesh-based reconfigurable 
architectures from the KressArray family has been presented. An according framework 
called KressArray Xplorer is based on a design system which allows the specification of 
the input language in a high-level language. During the exploration process, which is 
based on iterative refinement of the current architecture, the designer is supported by 
suggestions on how the current solution may be improved. To apply the framework for 
exploration of low power architectures, according models have been proposed, which 
can easily be integrated into the framework. 
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Abstract. Based on a concept of equivalent capacitance, previously 
developed, we present a novel analytical linear representation of internal power 
dissipation components in CMOS structures. An extension to gates is proposed 
using an equivalent inverter representation, deduced from the evaluation of an 
equivalent transistor for serial transistors arrays. Validation of this model is 
given by comparing the calculated results to the simulated values (using 
foundries model card), with different design conditions, implemented in 
0.25pm and 0.18pm CMOS processes. Application is given to delay and power 
optimisation of buffer and path. 



1 Introduction 

Careful study of the power dissipatiou iu submicrou CMOS structures shows that the 
iuterual compoueut may have coutributious greater thau uecessary to coutrol the 
differeut gates. It becomes theu esseutial to get au accurate aud desigu orieuted power 
estimatiou of static CMOS family for the desigu aud optimisatiou of high 
performauce circuits. As a result mauy authors developed accurate models of power 
dissipatiou dedicated to submicrou techuologies. Iu [1] a short circuit power 
cousumptiou formula is derived from a piece wise liuear represeutatiou of the short 
circuit curreut. A complex modelliug of the output waveform permits au accurate 
evaluatiou of the short circuit dissipatiou iu [2]. A macro-model of short circuit power 
cousumptiou has beeu deduced from a detailed delay aualysis iu [3]. 

Uufortuuately, these models are too complicated to defiue low power desigu 
criteria at cell level. We preseut here a desigu-orieuted model of the iuterual power 
dissipatiou compoueut based ou a previously developed coucept of equivaleut 
capacitauce. The defiued target is to allow direct comparisou betweeu the compoueuts 
commouly cousidered as siguificaut for CMOS circuits: the exterual dyuamic 
compoueut, associated to the gate output capacitauce charge aud discharge, aud the 
iuterual dyuamic compoueut due to the short circuit occurriug betweeu N aud P 
blocks aud to the overshoot discharge resultiug from the iuput to output coupliug. 

Iu order to defiue low power desigu criteria, at cell level, applicable to buffer 
desigu, we propose a desigu-orieuted modelliug of the short circuit power compoueut. 
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The major focus here will be to define the propitious design conditions minimizing 
the internal power component with respect to the external (capacitive) one. The 
internal power consumption model is described in section 2. A linear representation 
of this model is proposed in section 3 where we give validations of this approach 
targeting 0.25pm and 0.18pm process. In section 4 we model the serial transistor 
array through an equivalent transistor that we define. This allows to represent each 
gate by an equivalent inverter for which we can predict the internal power 
consumption . In section 5, we define sizing criteria for power minimisation. 
Application to buffer sizing and path optimisation is given in section 6. Conclusion is 
drawn in section 7. 



2 Power Consumption Model Description 



Using the equivalent capacitance concept proposed in [16], we can express the 
internal power consumption as follow : 



^INT ~^-f -^SC ^OV y^DD 

where r|, f, and Vdd are respectively the activity rate, the switching frequency and the 
supply voltage; Csc and Cqv are the equivalent capacitances that would generate the 
same power dissipation as the short circuit while reported on the output node. The 
evaluation of Cov is done using the expression proposed in [3]. The short circuit 
equivalent capacitance Csc is expressed as : 
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To evaluate C^^ we can, as in [2], directly perform the integration between t„ and tj 
which are respectively the beginning and the end of the short circuit. We can also [4] 
use symmetrical properties of the short circuit current (Fig. 1). This allows to perform 
the integration only between t„ and Cp which corresponds to the maximum short 
circuit current occurrence. We can also as in [1] assume that the short circuit current 
presents a linear variation between t„ and Cp; therefore the integration is reduced to the 
evaluation of a triangular surface. The height and the triangle basis being respectively 
Ij,^ max = ^sc (*^sp) = 2.(tjp-tp) which represents the short circuit duration. This 

leads to the following expression of the short circuit equivalent capacitance : 

_dsP~^OVLH^ T ( 3 ) 

^SCLH SC -MAX ^ 

^DD 

Where t^yppi = t^ corresponds to the end of the overshoot discharge [3]. Thus, the main 
difficulty here is to accurately evaluate the values of tj,p and Ijc-max- t^], these values 
have been calculated from the switching delays of the structure considering that the 
maximum current occurs when the operating mode of the short circuiting transistor 
evolves from linear to saturated mode. If this hypothesis has been sufficient for 0.7|im 
process it appears that with deep submicron (0.25pm and less) the position of the 
maximum current appears in linear mode and is modulated by desaturation effects of 
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carrier speed [4]. With such considerations, tj,p is evaluated from a temporal derivation 
of the short circuit current. To perform this derivation we assume, according to [6], 
that around tj,p the variation of the output voltage is linear (Eq. 4, 5), as a consequence 
the output slope duration is proportional to the step response tj^p^ of the inverter. 
Under this assumption the drain to source and gate to source voltage values necessary 
to determine the current evolution can be easily calculated. 
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Canceling the derivative of the current expression with respect to time gives directly 
the expression of Upas: 
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With this value of Up the equivalent short circuit capacitance is deduced from (3), 
considering a linear variation of the voltages around Up (Eq. 4, 5): 
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(7) 





Fig. 1. Evolution of the output voltage and 
current in a CMOS inverter; the short 
circuit current shape is triangular in shape 



Fig. 2. Simulated and calculated values of 
Cpj.p for an inverter (Wjj=lpm) design in 
0.18pm process and with different 
configuration ratio values k 



These equations have been used to calculate the equivalent internal capacitance 
values of inverters designed in 0.25pm, and 0.18pm process with various conditions 
of control, loads, sizes and configuration ratio. Validations have been obtained by 
comparing these values to values simulated from HSPICE simulations using the 
Kang’s method [7,8]. As shown in Fig. 2 the agreement between simulated and 
calculated values over the considered design range is good enough to validate the 
proposed method. Currently observed discrepancies are lower than 13%. 
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3 Linear Representation 



3.1 Linearization 



The expression (7) is a product of two terms. The first one is the short circuit duration 
and the second one is the maximum short circuit duration. Thus in the aim of 
simplifying (7), let us consider the short circuit duration. At the first order the 
overshoot duration is given by; 

^OVLH ~'^TNL INLH 

Therefore the short circuit duration can be expressed as a linear function of the input 
slope duration 

~^OVLH^ ~ ~^TPN INLH 

Let us now consider the maximum short circuit current, neglecting the small variation 
of the gate to source voltage with respect to (10), we note that the maximum 
short circuit current is proportional to the transistor width. Its evolution with the input 
slope can represented by a linear function of 
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Where is the inverter response to a step input [6] : 
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And and are single coefficients determined for all the inverters of the library. 
They can directly be obtained from (7) or calibrated on Hspice simulations. For a 
0.25pm CMOS process we obtain aj^^^ = 7.5.10' , bjj^^ = 3.75.10' , for an input falling 
edge, a^^jj = 2.25.10' , bj^jj = 3.10' for an input rising edge. 



3.2 Validation 

Validation of the linear representation given by Eq. 12, is illustrated in Fig. 3. In this 
figure we compare for a 0.18pm CMOS process, simulated and calculated values of 
the equivalent capacitance representing the total internal power dissipation 
component. As shown we obtain a good agreement between calculated and simulated 
values over all the considered design range. Commonly, discrepancies are lower than 
15 %. 
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Fig. 3. Simulated values of of an 
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inverter (W^=4 pm) loaded by Cj^=10.C,j, 




Fig. 4. Simulated (lines) and calculated 
(with the linear representation) values of 
the internal power dissipation of the 
inverter defined in Fig. 2. 



3.3 Comparison with Previous Work 

In Table 1 we give a more complete validation in comparing previously proposed 
formulas [4,9,10] to the macro-model presented here. 

Table 1. Comparison of internal power dissipation formulae with Spice simulation results for 
an inverter (C„,=20.7fF loaded by 145fF L(,j,(j=0.25pm) for various input slope conditions 
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5,65 
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5,31 


5,48 


16 


9,15 


7,53 


10,48 


8,36 


7,80 


20 


11,43 


9,41 


16,05 


12,06 


10,34 



Eq. 10 of Veendrick has been updated for submicron process and represents the zero- 
load model. Eq. 1 1 of [9] presented by Sakurai and based on the a-power model uses 
the zero load assumption, Eq. 12 from [10] is also based on the a-power model but 
does not consider the overshoot component. As shown in the Table 1 the proposed 
expression (Eq. 11) including linear representation of overshoot and short circuit 
components is still accurate for a large range of input slew. 



4 Extension to Gates 



The main difference between an inverter and a logic gate is the presence of a serial 
array of transistors in the N or the P block. To evaluate the internal power dissipated 
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in CMOS gates, we represent each gate by an equivalent inverter defined for each 
possible switching condition. As an example let us consider a Nand2 where two 
switching conditions may be defined depending on the input control data (Fig. 5). 
Considering the top input of the serial array as the controlling one, it is clear that its 
driving gate strength is reduced by the voltage drop across the bottom transistor 
working in the linear mode (Fig. 6). 



This defines two reduction factors, Red^^,^ and Redj,^.j, for falling and rising input 
controlling edges respectively, for which the top transistor is working in linear or 
saturated mode, as: 



where R is the drain source resistance of a unit transistor (Width=lpm), K^s^t the 
conduction coefficient of the N transistor working in saturation. Considering the 
bottom transistor, due to the voltage level degradation occurring on the conducting 
top N transistor, its working mode is the same than that of the N transistor of an 
inverter working under a lower supply voltage V^j, = (Fig. 7) In this 

condition the overshoot duration increases and the short circuit duration will be lower 
than that of an inverter. Consequently, the equivalent short circuit value will be 
modified through the supply voltage reduction (Vqj,-V^^ instead of V^p). 

VfmV) I(inA) Voltage 




Fig. 5. Two input Nand structure 



= - Wtop-R-^ovt^sp) 1 ^jvsAr • ^TOP • ^ 



NSAT 




Fig. 6. Resistive comportment of the 
bottom transistor of a Nand3 gate when 
only the top input has switched 



Fig. 7. Voltage drop across the top 
transistor of a Nand2 gate when the bottom 
input is switching 
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For the Nand2 it is then possible to express the equivalent short circuit capacitance 
for a rising input as: 

( 17 \ C _ b 
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(15) 



with an equivalent expression for the falling edge. Flere is the voltage drop across 
the transistor Top, a^niH 1 *nd-lh process parameters. As for inverters, we 
compare in Table 2 simulated and calculated values of the internal capacitances. As 
shown the observed discrepancy is kept lower than 15%. 



Table 2. Comparisons between simulated and calculated internal capacitance for a Nand2 (W^ 
= 4pm Wp=12pm Cj^=198 fF L(,j,(j=0.25pm) only the bottom input has switched 
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12 


5.33 


5.64 


6% 


15 


7.24 


8.33 


15% 



5 Sizing Criteria for Power Minimization 

5.1 Buffer Sizing 

Since we are able to reduce each gate into an equivalent inverter, minimizing the 
power dissipated along a combinatorial path results in optimizing an inverter chain. 
Thus our concern here is to find the sizing of each inverter which minimizes the 
power. To get facilities let us define the smallest inverter chain that is possible to 
optimize for power. Although the external power dissipated by an inverter is imposed 
by its input gate capacitance, its internal contribution depends on the preceding stage 
(the input slew) and the following stage (the load). That is a real input output slew 
control. 



^i+l) Qi+iWi+i) 





Fig. 8. Smallest inverter chain to be 
optimized for power 



Fig. 9. Illustration of the variations, with 
respect to the size of the drive (i), of the 
internal and total power components 
necessary to drive a composite load 
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Usual design alternatives are expressed with respect to load and fan-out factors, 
therefore optimizing the smallest inverter chain results in finding the optimal sizing of 
the stage (i) which determines the fan-out factors between stages (i-1) and (i) 
minimizing the power dissipation. This is illustrated in Fig. 9 which represents the 
variation of internal and total power dissipation on the preceding array, versus the 
size of inverter (i). 

C,^ represents the equivalent capacitance relative to the internal power dissipation 
on the array, is the sum of this capacitance and of the total array input 

capacitance. As shown out of the minimum area bad selection of the buffer (to small 
or too large) will result in an unnecessary extra power dissipation. 

Using Eq. 10, it is quite easy to evaluate the internal power component of the array 
given in Fig. 9 and to search for the condition on the intermediate inverter size 
minimizing this component. The summation of all the contributions results in a 6‘ 
order polynomial expression that can only be solved graphically or numerically. 
Neglecting in Eq. 10 the a^HL’ contributions results in a 3"* order expression which 
can be solved as: 
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The validation of Eq. (16) and (17) has been done by comparing the optimal values of 
the buffer input capacitance (C^y) obtained from Hspice to the values predicted by the 
model. As shown in Fig. 10, we obtain a good agreement between simulated and 
calculated buffer size. 




Fig. 10. Comparisons between calculated and simulated of W(i) optimal values for two cases 
(L<,go=0.25pm) : Casel : W(i-l) =lpm = 2.23 C|^/C(i-rl)=5 C,,=C3=0 Case2 : W(i-l)=lpm, 
k(i)=3 CL/C(U1)=10 C,,=10fF C3=2.C(i-t-'l) 



5.2 Path Sizing 

The optimization criteria for delay and power proposed in [11] and in Eq. 16, 
respectively, are mostly non linear. In these conditions global optimisation of a 
combinatorial path is unpractical for logical depth value greater than 5. We propose 
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here to apply a local optimisation in which processing for the selected path from 
output to input we optimize each element with its real load hut driven by a reference 
inverter. This procedure has been shown [11] efficient and fast enough to be applied 
to this optimisation problem. 




Fig. 11. Circuit used to illustrate the optimisation procedure: Xj represents the inverters (gates) 
to be sized, the capacitance values on the different nodes represent the parasitic loads 

Results are quasi optimum, delay constraints being managed by sizing the global 
reference. For illustration we applied this procedure to the example given in Fig. 1 1 . 
First we implement an initial solution where all the transistors have the minimum 
size. Then we identify the critical path (Xj to X^in this example). For a given fan out 
factor of the last stage of this critical path, we size all the stages of this path, 
processing backward according to the following rule: the sizing for minimizing the 
power is obtained in the same way applying the sizing Eq. 16. 

Non critical paths are sized under the same criteria by varying the Cj^^p value in 
order to satisfy the delay defined by the critical path. This procedure has been 
applied to the example shown in Fig. 1 1 . The results obtained have been compared to 
those obtain using a regular sizing ( Fan-out =cste) and a uniform sizing (all 
transistors are identical), when imposing a delay constraint equal to 460 ps. Those 
comparisons are summarized in Table 3 where we report the values of the critical 
path delay, the total power dissipation, the power delay product (PDF), the power 
delay surface product (PDFS), the total transistor width used as an indicator of the 
total active area and the slope out. 



Table 3. Comparison between our, the uniform and the regular sizing (Fo = cste) 



SIZING 


REGULAR 
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OUR METHOD 


Delay (ps) 


450 


471 


456 


Total Power (mw) 


200 


224 


187 


SW (pm) 


31 


30 


23 


P.D.P. (fj) 
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P.D.P.S (pj.pm) 


2.7 


3 
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Slope out (ps) 


114 


239 


153 
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6 Conclusion 

Based on a concept of equivalent capacitance we demonstrate the possibility in 
characterizing the internal power dissipation of CMOS inverters through accurate 
design oriented expressions. Clear indication of input controlling slopes, output loads 
and gate structure may help designers in defining fan out control as a design strategy 
for minimizing internal power dissipation. Extension to general gates as been 
proposed through an algorithm for gate reduction to an equivalent inverter. 

The equivalent capacitance concept we used gives facilities in comparing directly 
the different power dissipation components in terms of fan-out factors that can be 
obtained at the circuit level and used to drive optimisation alternatives. 

Validations have been performed by comparing calculated and simulated 
(HSPICE) values (0.25-0. 18pm CMOS process) on a wide range of input control 
slew. The interest of this model is in defining transistor resizing rules for power 
optimisation. Application has been given to power optimisation under a delay 
constraint 
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Abstract. To be able to predict the importance of glitches in future 
deep-submicron processes with lowered supply and threshold voltages, 
a study has been conducted on designs, which experience glitching, at 
supply voltages in the range from 3.5 V to 1.0 V. The results show that 
the dynamic power consumption caused by glitches will, in comparison to 
the dynamic power consumption of transitions, be at least as important 
in the future as it is today. 



1 Introduction 

Glitches are unnecessary signal transitions which do not contribute any infor- 
mation or functionality. The glitches can be divided into two different groups: 
generated and propagated. A generated glitch can occur if the input signals to 
a gate are skewed. If a generated glitch occurs at the input of a gate, the glitch 
may propagate through the gate; in that case we have a propagated glitch. 

The number of glitches in a circuit depends mainly on the logic depth, gate 
fan-outs and how well the delays in the circuit are balanced. One obvious way to 
reduce glitching is to introduce pipelining, which would reduce the logic depth 
at the cost of power from pipeline registers. In circuits with large logic depths, 
the power consumption caused by glitches can be severe. In a non-pipelined 
16 X 16-bit array multiplier, 75% of the switching in the circuit are glitches [1]. 

At almost all levels of abstraction, from the circuit level to the behavioral, 
techniques have been suggested to reduce the power consumption caused by 
glitches. At the circuit level, one popular way of reducing the power consumption 
is path balancing, where gates are resized and buffers are inserted to equalize 
the delays to the gates [2]. Restructuring multiplexer networks and clocking of 
control signals are techniques that can be used at the register-transfer level [3] . 

In future processes, the supply voltage has to be scaled even more than today 
to accommodate the demands for a lower power consumption. Other driving 
forces for supply voltage reduction are reduced channel lengths and reliability 
of gate dielectrics [4]. To retain the performance of the circuits, the threshold 
voltage, Vt, has to be scaled accordingly. However, a 100 mV decrease in Vt 
will increase the leakage current 10 times (@85°C). Therefore, the scaling of Vt 
is done at a slower pace and it might stop at a Vt of approximately 0.2 V [5]. 



D. Soudris, P. Pirsch, and E. Barke (Eds.): PATMOS 2000, LNCS 1918, pp. 139—148, 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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In this paper we will examine the swing distribution and dynamic power con- 
sumption of glitches when the supply voltage is lowered. Two different scenarios 
are considered; in the first scenario the threshold voltage is kept constant when 
the supply voltage is lowered. In the second scenario, the threshold voltage is 
scaled proportionally to the supply- volt age scaling. 

2 Simulations 

To see what is happening to the glitches when the supply voltage is lowered, some 
circuits, which experience a lot of glitching, i.e. adders and multipliers, have been 
simulated in HSpice™ and the behavior of the glitches has been studied. 

In order to keep track of the glitches in a circuit, a C program, which detects 
glitches in the transient file from HSpice™, has been used. The output from the 
C program has been processed and analyzed using Matlab. 

2.1 Circuit Selection 

A large number of glitches are needed in the simulated circuits in order to make 
the analysis more valid. Both adders and multipliers are known to experience 
a lot of glitching [1,6]; therefore, one 8-bit adder and two array multipliers of 
different sizes have been implemented in layout and extracted netlists have been 
used in the simulations. The AMS 0.35 p,m process has been used in the imple- 
mentations and all transistors in the designs are minimum sized. 

The 8-bit adder is an ordinary ripple-carry adder (RCA8) and it is imple- 
mented in static CMOS, in this case the mirror adder [6]. The multipliers are 
one 4 X 4-bit and one 8 x 8-bit array multiplier; both have been implemented as 
carry-propagate multipliers [6]. 

2.2 The Glitch-Detection Program 

The program, which detects and calculates the power consumption of glitches 
and transitions, is written in C. The user of the program has to specify in 
which nodes the program should search for glitches and transitions. In the power 
calculations, only the dynamic power consumption is considered, i.e. short-circuit 
and leakage power consumptions are neglected. Neglecting the leakage current 
is still valid today, but in future processes, the leakage power will increase its 
importance significantly. 

In our circuit analysis, we have chosen to study the nodes in which the 
transitions can have rail-to-rail swing, i.e. nodes that are situated between an 
NMOS and a PMOS net. The intermediate nodes between transistors inside an 
NMOS or a PMOS net have been ignored to reduce the simulated data. 

The program uses the transient file and the capacitance table from HSpice™ 
to find and compute the power consumption of glitches and transitions. The 
power consumption of a glitch (corresponding to two transitions) is calculated 



as 



P= folk- C- Vdd ■ 



( 1 ) 
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where fdk is the clock frequency, C is the capacitance for the node as given by 
HSpice™, Vdd is the supply voltage, and AV is the swing of the node. 

The program keeps track of the nodes specified in the setup and checks if the 
node voltage has changed more than a predetermined glitch amplitude value, e.g. 
10%. If the voltage becomes larger than the glitch amplitude value we either have 
a transition or a glitch. If the voltage level returns to the threshold within a clock 
period we have a glitch, otherwise we have a transition. After we have registered 
a glitch we have the possibility to register another glitch or a transition. In Fig. 1, 
a glitch followed by a transition in a node of the simulated 8 x 8-bit multiplier is 
shown, Vdd = 2.8 V. 




Fig. 1. A glitch and a transition during the same clock cycle, fdk = 100 MHz 



The program outputs the start and stop time and the maximum amplitude 
of all glitches it has found, together with the node to which the glitch belongs. 
We also get information from the program about how much of the power con- 
sumption originates from glitches and how much originates from transitions. 

2.3 Simulation Strategy 

To be able to make some predictions of the importance of glitching in the future, 
the circuits have been simulated under two different conditions. In the first case, 
we have lowered the supply voltage from 3.5 V down to 1.0 V without changing 
the threshold voltage. In the second case, the threshold voltage is scaled pro- 
portionally to the supply voltage. For example, at 3.3 V, the threshold voltage 
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is 0.38 V for the NMOS transistor which is scaled to 0.17 V at Vdd 1-5 V. At 
a supply voltage of 1.0 V, the scaled Vt becomes 0.12 V which is an unrealistic 
value. The minimum usable Vt is approximately 0.2 V at room temperature. If 
a lower Vt is used, the leakage becomes intolerable [5]. However, the two simu- 
lation conditions (constant Vt and scaled Vt, respectively) can be used as limits 
for predicting the future importance of glitches. The Vt scaling will certainly lie 
somewhere within these limits, but it is hard to predict exactly where. 

Two-hundred random test-vectors have been fed to the simulated circuits 
and the supply voltage has been decreased in steps of 0.1 V. The simulation and 
processing time for the 8 x 8-bit multiplier has been 10 days on a Sun Ultra 10, 
333 MHz. 

3 Discussion and Simulation Results 

The output data from the C program have been processed and plotted using 
Matlab™. In Fig. 2, we have plotted the power consumption of glitches and 
transitions for different supply voltages. We have also plotted the ratio between 
the power consumption of glitches and transitions. In the left column we have 
the RCA8 and the 4x4- multiplier and in the right column we have the 8x8- 
multiplier. The dotted lines are constant Vt, and the solid lines are scaled Vt- 

As expected, the power consumption of transitions, plots (c) and (d), falls 
off with the square of the supply voltage. The glitches, on the other hand, which 
are in plots (a) and (b), show a somewhat different behavior. 

In plots (e) and (f), we have plotted the relative power consumption of 
glitches compared with the total dynamic power consumption. We can see that 
approximately 40% of the power consumption stems from glitches in the 8 x 8-bit 
multiplier. For the RCA8 and the 4x 4-bit multiplier, the figures are 15% and 
10% respectively. In the multipliers, the power consumption of glitches goes up 
for lower supply voltages; it can be hard to spot in the plot though. In the RCA8, 
on the other hand, the relative power consumption of glitches goes down for low 
supply voltages. This is of course in the case where a constant Vt is used. If 
Vt is scaled proportionally, the glitch power consumption falls off at exactly the 
same rate as the power consumption of transitions. 

In an attempt of trying to understand why the glitches show different behav- 
ior for the two scenarios and also between different structures, we have plotted 
the voltage swing distribution of glitches at different supply voltages in Fig. 3. 
On the x-axis, we have the voltage swing relative to Vt>d smd on the y-axis we 
have the supply voltage. The number of glitches is plotted in the z-direction. The 
voltage swing has been divided into 0.05 wide bins to improve the readability. 

The first two things that one observes are that almost all glitches have full 
swing and that there are no glitches with an amplitude lower than 10% of Vdd- 
The low-swing glitches are missing because of the glitch amplitude value set in 
the glitch-detection program. If the glitch amplitude value is changed to a lower 
value we would get a similar peak as for full-swing glitches. The drawback with 
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Fig. 2. Dynamic power consumption of transitions and glitches 



a lower threshold voltage is that we might fail to detect glitches whose peak is 
close to one of the supply rails. 

In circuits where there are many short signal paths and few longer ones, the 
vast majority of glitches are generated in the gates from which they are output, 
whereas very few glitches are the result of mere propagation. There are simply 
very few paths to propagate glitches through. If the logic depth, or rather the 
number of long paths, is larger, the propagated glitches will consequently be 
much more common. In circuits with large gate fan-outs the number of glitches 
that are propagated may even increase exponentially. At some point, for a certain 
size and a certain structure of the circuit, the propagated glitches may very 
well dominate the total number of glitches. For the circuits considered in this 
paper, we have the 4x4-bit multiplier, where the ratio between propagated and 
generated glitches is larger than in the RCA8 circuit, and the 8 x 8-bit multiplier, 
where the ratio has grown even larger. 

A generated glitch is a function of the difference in arrival times of the input 
signals to the gate producing the glitch. A propagated glitch, on the other hand, 
is a function of the gate transfer characteristics. Let us now consider the CMOS 
gate transfer function: Any such transfer function tends to make glitches with 
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low swing even smaller and large-swing glitches even larger. However, the sharper 
the knee of the function, the more pronounced this effect is. As shown in Fig. 4 
reducing Vop, while keeping Vr constant, makes the transfer function sharper. 
Reducing Vod, while proportionally reducing Vr, will obviously keep the shape 
of the transfer function intact. 

The implication of the ratio between the number of propagated and gener- 
ated glitches is that the bigger this ratio is for a circuit, the more the circuit 
is depending on the transfer characteristics of the gates. There are, relatively 
speaking, very few glitches having medium-range swing inside circuits, which 
contain a large number of long paths or where the average gate fan-out is fairly 
large. Consequently, for the 8x8-bit multiplier, where there are many long paths, 
the propagated glitches are dominating. Thus, this circuit depends heavily on 
the transfer functions of the gates, i.e. we have relatively few medium-range 
glitches in this circuit. With the same line of discussion, the RCA8 circuit will 
have fairly many glitches with medium-range swing, at least in comparison to 
the multipliers. This is clearly illustrated in Fig 3. 

Also, the dependence of the CMOS gate transfer function on Vdd and Vr 
can be observed in Fig. 3. In all three graphs showing Too reduction at constant 
Vt, the number of medium-range glitches is decreasing with reduced Vdd- This 
is due to the fact that the slope of the transfer function is getting steeper with 
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Fig. 4. Voltage-transfer functions of an inverter at Vdd = ^-Q V and Vdd~3.3 V, con- 
stant Vr=0.38 V 



reduced Vuu. With the same line of reasoning, for the simulations using Vuu 
reduction with scaled Vt, there is no redistribution of glitches in terms of voltage 
swing, since the transfer function stays constant in shape. 

The next thing we can observe is that if Vt is kept constant, the number 
of full-swing glitches increases when Vdd is lowered. To explain this, we use 
Sakurai’s alpha-power model [7] . From the model we get the following expressions 



t 



p ~ 




^ Vdd 1 , . ^dVdD 



tr 



ClVdd f 0.9 
Ido 
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Vdd, ref 
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where tp and It are the propagation and transition times respectively, and Eq. 4 
is used to recalculate the drain-saturation voltage to a different Vdd- If Eq- 2 
and Eq. 3 are combined, we get the following expression for the propagation time 
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Fig. 5. Voltage-swing distribution of glitches at different supply voltages 



Now, let us assume that we have a gate, e.g. an inverter, loaded by a capaci- 
tance, Cl, and also make the assumption that the capacitance is discharged by 
the constant current, I. We have the following expression for the voltage swing 
of the output node 



I = 



^ AV lAt 

C,.— ^AV= — 



( 6 ) 



where At is the duration time of the input signal which causes the discharge 
of the node. If a glitch appears at the input of the gate, it must be due to 
differences in propagation delay and thus, we can model its duration time as 
/c 2 • tp. If we assume that we have a typical short-channel device, i.e. awl 
and that expression ki in Eq. 5 is constant, i.e. independent of Vdo, we get the 
following expression for the voltage swing 



AV = 



Ido • fe 



ClVdd 

Ido 




-^{kiVT + Vdd) (J) 



If we now calculate the relative voltage swing with constant Vr at two different 
supply voltages: Vddi = Vdd and Vdd2 = Vdd/S, where S is the scaling factor, 
we get the following results 
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AVi _ ^(kiVr + Vnm) ^ ^{k\VT + Vdd) 

Vz)m yoDi Vdd 

AV2 _ ^(kiVr + Vdd2) _ ^(kiVr + Vdd/S) _ ^(kiVrS + Vbd) . s 
VdD2 VdD2 VdD / s Vdd 

Eqs. 8 and 9 show that if the supply voltage is scaled down and the threshold 
voltage is kept constant, the swing of the node increases. A recalculation of Eq. 9 
with a proportionally scaled threshold voltage, Vt 2 = VtIS gives 



AV^ ^ 'f{k^VT2 + VDD2) ^ ^ f (fciVT + VDZt) 

Vdd2 Vdd2 Vdd 

which gives as a result that the swing of the node is constant if Vt is scaled 
proportionally. Despite all rough approximations, both these statements agree 
with the results from the simulations in Fig. 3. A magnified plot of the glitch 
distribution of the 4x 4-bit multiplier is shown in Fig. 5. 



3.1 Verification Using 0.13 /xm Process Parameters 

To evaluate the simulation results, the circuits have also been simulated using the 
0.13 /xm parameters from the Device Research Group at Berkeley [8]. The supply 
voltage was 1.5 V and the NMOS threshold voltage 0.24 V giving a Vt-Vdd ratio 
of 0.16 which is between a constant Vr, ratio = 0.25, and a proportionally scaled 
Vt, ratio = 0.12 of the 0.35 /xm process. We use the number of full-swing glitches 
as a measuring device in the evaluation. From the simulations we get the results 
in Tab. 1. Since the values of the 0.13 /xm process are in between our predicted 
limits, they do not contradict our results. 



Table 1. Number of full-swing glitches 



Circuit 


Const. Vt, 0.35 /xm 


Sim., 0.13 /xm 


Scaled Vr, 0.35 /xm 


Pred., 0.13 /xm 


Error 


RCA8 


122 


94 


86 


97 


-3.1% 


Mult4x4 


167 


142 


139 


147 


-3.5% 


Mult8x8 


4543 


3542 


3435 


3767 


-6.0% 



4 Conclusion 

The power consumption and voltage-swing distribution of glitches have been 
studied for two types of circuits; the ripple-carry adder and the array multiplier. 
The main reason for the study was to see if the importance of glitches will 
increase or decrease in future processes when the supply voltage is scaled down 
further. 
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Two different scenarios have been considered, one with constant Vt and 
one with Vt scaled proportionally to Vdd- Neither of these two scenarios will 
predict the future Vt scaling, but the truth will certainly lie within the limits of 
the simulations of the two. 

When Vt is scaled proportionally to the supply voltage, the relative power 
consumption of glitches stays almost constant. Furthermore, the voltage-swing 
distribution remains the same during Vdd scaling. That is, if Vt is scaled propor- 
tionally, the conditions for glitches will be the same in the future as it is today. 
However, as mentioned earlier, such Vt scaling is impossible due to leakage. 

In the other scenario, where Vt is kept constant, the relative power consump- 
tion of glitches increases by some percent for the multipliers and decreases by 
some percent for the ripple-carry adder. The voltage-swing distribution changes 
in this scenario. The number of full-swing glitches increases when the supply 
voltage is lowered. This is the cause of the small relative increase in glitch power 
for the multipliers. 

Under the assumption that the leakage power can be kept at a reasonable 
level in future processes, the overall conclusion drawn from this study is that 
the power consumption of glitches will at least be at the same relative level as 
today. 
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Abstract. This contribution extends the Degradation Delay Model (DDM), pre- 
viously developed for CMOS inverters, to simple logic gates. A gate-level 
approach is followed. At a first stage, all input collisions producing degradation 
are studied and classified. Then, an exhaustive model is proposed, which defines 
a set of parameters for each particular collision. This way, a full and accurate 
description of the degradation effect is obtained (compared to HSPICE) at the 
cost of storing a rather high number of parameters. To solve that, a simplified 
model is also proposed maintaining similar accuracy but with a reduced number 
of parameters and a simplified characterization process. Finally, the complexity 
of both models is compared. 



1 Introduction 

As digital circuits become larger and faster, better analysis tools are required. It means 
that logic simulators must be able to handle bigger circuitry in a more and more accu- 
rate way. Simulating larger circuits is aided by the evolution of computer systems 
capabilities, and accuracy is improved by providing more realistic delay models. 

Currently, there exist accurate delay models which take account of most modern 
issues [1, 2, 3, 4]: low voltage operation, sub-micron and deep sub-micron devices, 
transition wave-form, etc. Besides these effects there are also dynamic situations 
which might be handled by the delay model. The most important dynamic effects are 
the so-called input collisions [5]: a gate behavior when two or more input transitions 
happen close in time may be quite different from the response to an isolate input tran- 
sition. Of all these input collisions, there is a special interest in the glitch collisions, 
which are those that may cause an output glitch. Being able to handle these glitch col- 
lisions is important since they are more and more likely to happen in current fast cir- 
cuits, and will help us to determine race conditions and truly power consumption due 
to glitches [6, 7]. This is also strongly related to the modeling of the inertial effect [8], 
which determines when a glitch is filtered, and to the triggering of metastable behavior 
in latches [9, 10, 11, 12]. Other authors have treated the problem of glitches, either par- 
tially or not very accurately [5, 6, 7, 13]. 



D. Soudris, P. Pirsch, and E. Barke (Eds.): PATMOS 2000, LNCS 1918, pp. 149-158, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 




150 



J. Juan-Chico et al. 





T 


tp 


in; 


out 




1 > 


T 



out 



(a) (b) 

Fig. 1. Quantification of delay degradation: a) degradation due to a narrow pulse, b) degradation 
due to a glitch collision. 

In a previous work [14, 15] we have studies the problem from a more general point 
of view, called the Delay Degradation Effect, showing its importance and proposing a 
very accurate model for the CMOS inverter. The model obtained is called Degradation 
Delay Model (DDM). 

In the present paper we extent the model to simple gates (<N>AND, <N>OR) from 
the viewpoint of a gate-level modeling, looking for an external characterization suited 
to standard cell characterization. In Sect. 2 we summarize the basic aspects of the 
DDM. Then we will make the extension to gates, studying the types of glitch collisions 
and defining an exhaustive model for degradation at the gate level in Sect. 3. From the 
characterization results in section Sect. 4, we will derive a simplified model, which 
accuracy and complexity is compared to the exhaustive one. Finally, we derive some 
conclusions. 

2 Degradation Delay Model (DDM) 

The degradation effect consists in the reduction of the propagation delay of an input 
transition to a gate, when this input transition takes place close in time to a previous 
input transition. This effect includes the propagation of narrow pulses and fast pulse 
trains, and the delay produced by glitch collisions. This reduction in the delay can be 
expressed with an attenuating factor applied to the normal propagation delay, t^Q , 

which is the delay for a single, isolated transition without taking account of the degra- 
dation effect: 



t 



P 






>0 



I - e 



( 1 ) 



where T is the time elapsed since the last output transition, and determines how much 
degradation applies to the current transition, and Tq and T are the degradation param- 
eters, which are determined by fitting to electrical simulation data. For a given input 
transition, degradation will depend on the value of T, which express the internal state 
of the gate when the transition arrives, caused by previous transitions (Fig. 1). Parame- 
ters tpQ, Tq and T , in turn, depend on multiple factors: input transition time 
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output load (C^ ), supply voltage ( ) and gate's geometry and W p ). For the 
normal propagation delay, , good models can be found in the literature [2] and any 
of them can be used here. In [14] we obtained expressions for Tq and x as a function 
of these parameters: 



Cl 

'^x^DD “ 



Ox 






V 

2 



Ty 



DD 



( 2 ) 



where the pair (x, y) is (f, N) or (r, P) to distinguish falling from rising output transi- 
tions respectively. and Vpp are the MOS transistors thresholds. The parameters 
a, b and c are obtained in order to fit simulation data and characterize the process. 



3 Degradation Delay Model at the Gate Level 

In this section we will extent the DDM to simple gates (<N>AND, <N>OR) by per- 
forming three steps: 

1. Reformulate (2) at the gate level, when no information about the gate’s internal 
structure is available. Gate-level degradation parameters are defined in this step. 

2. Finding out which distinct cases may lay to delay degradation. These are the glitch 
collisions or degraded collisions. 

3. Defining a set of parameters for each glitch collision. 

Due to point 3, the model defined this way may contain many parameters, with a par- 
ticular set for each glitch collision case. Thus, this model will be referred to as gate- 
level exhaustive model for delay degradation. The purpose of this model is to be able 
to reproduce the propagation of each glitch collision with maximum accuracy. 

3.1 DDM Reformulation at the Gate Level 



To rewrite (2) we join together in a single new gate-level parameter the old ones and 
those internal parameters, not visible at the gate level. In other words, a becomes A , 
b^/Wy becomes B and c^V becomes C . This way, (2) is rewritten as 




( 3 ) 



A gives the value of x when = 0 , and is strongly related to the gate’s internal 
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Table 1. Glitch collisions characteristics for NOR and NAND gates, “i” is the index of the input 
changing alone or in second place. “/’ is the index of the input changing in first place. 



Type of 


Input evolution 


Final output transition 


collision 


NOR 


NAND 


NOR 


NAND 


Type 1 


i: 0-1-0 
rest: 0 


i-. 1-0-1 
rest: 1 


rising 

(r) 


falling 

(f) 


Type 2 


7 : 1-0 

i: 0-1 
rest: 0 


j: 0-1 
i: 1-0 
rest: 1 


falling 

(f) 


rising 

(r) 



output capacitance; B depends on the geometry (or equivalent geometry) of the gate 
and C is related to some “effective” gate threshold. A single value of A, B and C will 
be calculated for each glitch collision. 

3.2 Glitch Collisions 

In a simple gate we can distinguish two types of glitch collisions, depending on how 
and to which values inputs change. To be able to talk in a general sense we will call S 
the sensitizing logic value, or the logic value of the inputs which makes the output of 
the gate sensible to other inputs. It is “0” for (N)OR gates and “1” for (N)AND gates. 

The opposite value will be noted as S (non-sensitizing logic value). 

When in a simple gate all inputs are equal to S , the output value is S for non- 
inverting gates and S for inverting gates. For any other input vector, the value of the 
output is the opposite. In the following we will consider inverting gates since a similar 
discussion can be applied to the non-inverting case. Using this, two types of glitch col- 
lisions can be defined 

• Type 1 : Initially, have value S and the output is S . The output may change if any 
input changes, and a glitch may occur only if the same input changes again to value 
S . This type corresponds to a positive pulse in one input of a NOR gate or a nega- 
tive pulse in one input of a NAND gate. Only one input is involved in this type of 
glitch collision and then, n possible collisions of type 1 exist for a n-input simple 
gate. 

• Type 2: In this case, every input except one (the y-th) have value S and the output is 
also S . The output may change only if input j changes to S , and an output glitch 
may occur if any input (the i-th) changes to S . This way, any input pair (even if 
i = j) may produce a glitch collision of type 2 , resulting in n^ possibilities. 

We use collision-i to refer to type-1 collisions with i-th input changing, and collision-ij 
to refer to a type-2 collision with input i-th changing after input 7 -th. In Table 1 we 
have summarized the properties of both types of collisions for NOR and NAND gates. 
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Table 2. Vector/matrix form of gate-level degradation parameter for an INVETER and two- 
inputs NOR and NAND gates. 



Type of gate Parameter A Parameter B Parameter C 




3.3 Exhaustive Model for Gate-Level Delay Degradation 

The total number of collisions for a n-input gate including type-1 and type-2 is 

n + n? = n(n H- 1) . (4) 

Any of such collisions may be studied like an inverter under a narrow pulse input. 
Equations (1) and (3) can be applied to each case and a particular set of (A, B, C) 
parameters obtained for each collision. In this sense, if we make A to represent any of 
T , Tq,A,B or C, we can refer to any single value with a notation like this: 

• : value of parameter A for collision-/. 

• A- , . : value of parameter A for collision-iy. 

Sij 

These parameters can be expressed in vector/matrix notation like this: 

■■■ 

^ 

In Table 2 we show the vector/matrix form or parameters A, B and C for gates NOR2, 
NAND2 and INVERTER. Using (5), the expressions in (3) can also be written in vec- 
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tor/matrix form: 
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( 6 ) 



where {/„ and L/„„ are n-dimensional all- I's vector and matrix respectively. 



4 Results 

To obtain the whole set of parameter for a gate we use a characterization process which 
consists in two tasks: 

1. Obtain vs. T curves (see eq. 1) using an electrical simulator like HSPICE. For 
each curve, a value of T and Tq is obtained by fitting the simulation data to (1). 

2. Task 1 is done repeatedly using different values of and . The resulting T and 
Tq data is fitted to (3) and a value of A, B and C obtained. 

The two phases are carried out for each glitch collision. The whole process in order to 
fully characterize a gate is quite complex. For example, the exhaustive characterization 
of a NAND4 gate requires performing about 8000 transient analysis. To make such a 
complexity affordable, we have developed an automatic characterization tool which 
handles the whole characterization process, from launching the electrical simulator 
which performs the transient analysis, to make the curve fitting tasks. Using this tool, it 
is quite straight forward to study a wide set of gates. 

Qualitatively, the results obtained for all gates analyzed are quite similar in the 
sense that simulation data can be easily fitted to (1) and (3), validating the degradation 
model. An example can be seen in Fig. 2. Gates ranging from 1 to 4 inputs have been 
analyzed. As an example, we present the results for a NAND4 and a NOR4 gates in 
Table 3. NAND4 data is also in graphical form in Fig. 3, and serves as example since 
all gates give quite similar qualitative results. 

5 Simplified Model 

It can be easily observed in Fig. 2 how A, B and C are almost independent of the first 
changing input (j) in type-2 collisions. It means that in practice, the degradation effect 
does not depend on which input triggered the last output transition, only on when that 
output transition took place. In other words, it depends on the state of the gate, but not 
on which input put the gate on that state. This makes that degradation parameters of 
the form A- . , to be very similar for different values of j. 

Based on this result we propose a simplified degradation model for gates, in which 
we consider a single value of the parameter regardless the value of j. It means substi- 
tuting each row in the matrices of Table 3 for a single value. This single value is partic- 
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Fig. 2. Example of simulation data fitting to degradation model: a) vs. J , b) x vs. , 
c) Jo vs. x,.„. 



ular one taken from each row (A- , ) and is noted A- . It is 

Sik Si 




^Sik 



V(i, y) . 



( 7 ) 



Any value of k with \<k<n is possible. Our criterion is to take an intermediate 
value of the form 



k = 




( 8 ) 



This way, each matrix in Table 3 is reduced to a single column, which can be writ- 
ten like a vector. The resulting simplified set of parameter for NOR4 and NAND4 
gates of the previous example are shown in Table 4. The number of glitch collisions 
that we need to take into account is reduced to 2n . 

The values of the parameter for different j are so similar that the simplified model is 
almost as accurate as the exhaustive model, but the number of parameters is greatly 
reduced, as well as the characterization process complexity. In Table 5 we compare the 
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Table 3. Vector/matrix form of gate-level degradation parameter for a four-inputs NOR and 
NAND gates. 





NOR4 




NAND4 


A, 


112.819 


145.08 


275.101 


568.706 




341.335 


363.03 


432.19 


533.097 




788.806 


804.331 


780.062 


786.426 




364.451 


356.81 


359.536 


357.584 




824.225 


824.258 


823.485 


824.397 




374.961 


364.568 


365.183 


365.746 












Ar 










860.778 


847.25 


852.561 


850.086 




395.57 


391.429 


390.884 


388.101 




875.267 


876.37 


881.897 


878.463 




436.244 


432.208 


421.57 


416.158 


Br 


2.71788 


2.62542 


2.41312 


1.83907 


Bf 


15.2991 


15.4685 


15.3365 


14.7835 




7.32507 


7.21159 


7.30652 


7.29638 




14.7053 


14.5088 


14.4525 


14.5096 




7.43454 


7.45502 


7.44032 


7.42662 




15.2026 


15.4239 


15.4003 


15.4015 


Bf 










Br 










7.49901 


7.5641 


7.52869 


7.54409 




15.6956 


15.7685 


15.7861 


15.833 




7.60508 


7.60983 


7.58054 


7.61039 




16.3134 


16.2464 


16.3738 


16.4578 


Cr 


1.56364 


1.47036 


1.39764 


1.29989 


Cf 


1.49791 


1.39779 


1.27071 


1.04927 




1.80267 


1.76748 


1.69145 


1.67959 




1.97685 


1.89809 


1.8573 


1.84559 




2.14557 


2.09964 


2.05788 


2.02964 




2.49992 


2.43175 


2.40956 


2.39455 


Cf 










Cr 










2.42609 


2.37594 


2.3378 


2.31878 




2.90296 


2.90767 


2.752 


2.74911 




2.74211 


2.70625 


2.67864 


2.68137 




3.2206 


3.20356 


3.1773 


3.15793 



Table 4. Vector form of simplified gate-level degradation parameter for a four-inputs NOR and 
NAND gates. 





NOR4 




NAND4 


A, 


112.819 


145.08 


275.101 


568.706 




341.335 


363.03 


432.19 


533.097 




804.331 


824.258 


847.25 


876.37 


A, 


356.81 


364.568 


391.429 


432.208 


Br 


2.71788 


2.62542 


2.41312 


1.83907 


Bf 


15.2991 


15.4685 


15.3365 


14.7835 


~Bf 


7.21159 


7.45502 


7.5641 


7.60983 


Br 


14.5088 


15.4239 


15.7685 


16.2464 


Cr 


1.56364 


1.47036 


1.39764 


1.29989 


Cf 


1.49791 


1.39779 


1.27071 


1.04927 


Cf 


1.76748 


2.09964 


2.37594 


2.70625 


Cr 


1.89809 


2.43175 


2.90767 


3.20356 
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' j 

Fig. 3. Graphical representation of gate-level degradation parameter for a NAND4 gate, i is the 
changing input in type-1 collisions, j and i are the first and second changing inputs respectively 
in type-2 collisions. The graphs show the variation of degradation parameters with the number 
of the input(s) changing. 

Table 5. Comparison of the exhaustive and the simplified model in terms of number of 
parameters and characterization complexity. If is the number of glitch collisions, the number 
of parameters is 3n^ and the number of transient analysis is stimated as 400n^ . is n(n -(-1) 
for the exhaustive model and 2n for the simplified model. 



n 


no. of parameters 
exhaustive simplified 


no. of Iran analysis 
exhaustive simplified 


1 


6 


6 


800 


800 


2 


18 


12 


2400 


1600 


3 


36 


18 


4800 


2400 


4 


60 


24 


8000 


3200 


5 


90 


30 


12000 


4000 



number of parameters and the characterization complexity (measured as the number of 
transient analysis) for both models, applied to gates with up to five inputs. The benefits 
of the simplified model are clear, specially when increasing the number of inputs. 
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6 Conclusions 

A way to extend the degradation delay model to the gate level has been presented. 
Those input collisions that may cause degradation effect (glitch collisions) have been 
analyzed and classified. Two models are presented: an exhaustive one which assigns a 
set of degradation parameters to each glitch collision, and a simplified one which asso- 
ciates a set of parameters to each input, instead to each collision. The simplifies model 
has similar accuracy but reduces both the number of parameters and the complexity of 
the characterization process. This model allows the accurate simulation of the degrada- 
tion effect at the gate level. An experimental simulator which implements this model is 
currently under development. 
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Abstract. The performance characterization and optimization of logic circuits 
under rapid process migration is one of the big challenges of nowadays 
submicron CMOS technologies. This characterization must be robust on a wide 
design space in predicting the performance evolution of designs. In this paper 
we present a second generation of analytical modeling of delay performance, 
considering speed carrier desaturation induced non linear variation of delay, 
I/O coupling, load and input ramp effects. A first model is deduced for 
inverters and then extended to logic gates through a reduction protocol of the 
serial transistor array. Validations are given, on a 0.18pm process, by 
comparing values of simulated (HSPICE) and calculated delay for different 
configurations of inverters and gates. 



1 Introduction 

The design complexity afford by actual submicron processes implies to increase the 
level of circuit abstraction to manage this complexity. But the need of accuracy 
imposes to get available, at the highest level of abstraction, accurate physical level 
information on the performance of the structures used in the design. 

Accurate timing circuit characterizations mnst be available at all the abstraction 
levels. Considering the external operating conditions they also may be able to predict 
the circuit performance evolution dnring process migration, voltage scaling or any 
alternative used for design optimization. 

Speeding up the design time implies using logic cells or macro cells with well 
characterized performances. Standard look up tables with linear interpolation are too 
time consnming and no more sufficient to model the delay performances of today 
designs implemented in snbmicron processes. 

An accurate modeling of this performance necessitates reliable data on the 
structure switching time together with their transition time. An accnrate prediction of 
these data must be obtained when varying the structure or its operating conditions 
snch as the load, the controlling input slew or the supply voltage. 

Different methods have been proposed to model the delay at gate level. In the 
empirical method [1] the delay is represented as a polynomial expression with 
parameters calibrated from electrical simulations. This results in an empirical 
representation withont any design information allowing design performance 
prediction or optimization. A complex modeling of the ontput waveform can also be 
nsed to obtain a good evaluation of the delay variation [2] . 
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The look up table method constitutes a discrete approach of the delay performance 
representation in which switching delay and output slope are listed versus the load 
and the slew of the input control. The final value is obtained from interpolation 
between the tabulated one. These tables feed from electrical simulations suffer of the 
uncertainty in defining the scale extent of the axis of the table to be characterized. 
Moreover they are of no help to the designer in defining optimization criteria as well 
as clearly showing reasonable limits to be considered for load and slew. 

In the third method which can be considered, a design oriented modeling of the 
switching delays and transition times is developed from a careful study of the 
switching process and currents of the logic structures. This method constitutes a new 
alternative for design and timing tools developers because it may be accurate, fast and 
gives opportunities for performance optimization. An accurate modeling at inverter 
level is usually given and then generalized to gates by reduction of the serial array of 
transistors to an equivalent one [3], [4], [5], [6], [7], [8]. 

The goal of this paper is to extend a former global modeling of delay [9], to the 
general characterization of delay performances of deep submicron processes in which 
preceding approximations are removed to consider the successive reduction of 
transistor channel length. We propose a second generation of delay model for library 
cells validated on 0. 1 8pm CMOS process. 

In section 2 we present this new model developed for inverters. The extension to 
gates is given in section 3. In each case validations are given through Spice 
simulations of different configurations of inverters and gates implemented in a 
0.18pm CMOS process. Finally a conclusion on this work and the future extensions is 
given in section 4. 



2 Second Generation Delay Model for Inverters 

It is generally observed that short channel effect related high electric fields and carrier 
speed desaturation effects during the switching process induce non linear variation of 
the delay with the external loading and controlling parameters. As a result the delay 
of CMOS structures depends not only on the structure but on the size of the gates, the 
load, the controlling input slope and the rank of the switching gate input [10] .In Fig. 1 
we represent the variation of the switching delay of inverters (gates) with different 
configuration ratio values versus the input ramp duration X,n/T„ls normalized with 
respect to the structure step response used as a metric for performance, that we will 
defined later. 

As shown, depending on the strength of the switching transistor, the delay 
variation appears highly non linear with the input ramp duration. This slow input 
ramp effect is difficult to be considered using look up tables and is responsible of 
large discrepancies between prediction and measurements. 

As a result the need for the definition of a timing performance model including an 
accurate representation of switching delays and signal transition times with clear 
evidence of the structure, the supply voltage value, the size of the transistors the 
output load and the input signal transition time. 
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Fig. 1. Illustration of the non-linear variation of switching delay of an inverter for different 
control and load conditions. 

We develop here the model for inverters focusing mainly on the falling edge, the 
rising one being deduced easily by inspection. 

2.1 Metric Definition 

As previously mentioned in Fig. 1. We defined metrics for delay in order to get easy 
calibration and design space representation. The first one is characteristic of the 
speed performance of the process and is independent of the transistor width [9]. The 
second one, represents the step response of an inverter with a real load 



where is the total output loading capacitance, C^, Cp represent the gate input 
capacitance of the switching transistor and is the dissymetry factor between N and 
P transistors. 

These parameters are used in a process characterization phase in which the 
transistor threshold voltage, conduction factor and I/O coupling parameters are 
calibrated following a well-defined protocol. 

2.2 Second Generation Model 

As illustrated in Fig. 2, using the inverter step response as a reference and considering 
linear variation of the output wave form, the switching time corresponding to an 
output falling edge is defined by: 



^HLS - '^ST 



N 



( 1 ) 




TW 



ThL =tSP A-tHLS 



( 2 ) 



162 M. Rezzoug, P. Maurine, and D. Auvergne 



where Tj,p is the time of occurrence of the maximum P transistor short circuit current 
[1 1], Tjp the input ramp duration time and t^pj’the cell step response corrected for slow 
ramp effects. 




Fig. 2. Inverter delay definition for an output falling edge. 



Considering that the P transistor short circuit current reaches its maximum value in 
linear operating mode at the edge of saturation, the time Tj,p can be defined easily 
from the derivative of the current expression with respect to time. This gives: 



tsp — 



^IN 
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( 3 ) 



where v^p^ is the threshold voltage value defined with respect to the supply voltage 
and Tgy the overshoot duration time [12] defined by: 



tnv=%. 






( 4 ) 



with: 




Cm ■'^st ) 



where C„ represents the I/O coupling capacitance. 



( 5 ) 



In the same way using the preceding metric slow ramp effects on delay can be 
reproduced from [9]: 



* f Vmp\ ( 6 ) 

Thls =Thls .\l-2. \ 

where Vp,j,p represents the drain source voltage value of the P short circuiting 
transistor. Developing the three terms of eq.2 results in a complete design oriented 
delay modeling. 
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As shown this model exhibits explicit delay dependency on the technological 
parameters (Tst)> the cell physical parameters (C^,Cp,C„) and the load and control 
conditions (C^, Tj^). 

Despite its accuracy, the development of eq.2 is still too complicated to be used for 
performance evaluation and optimization. We propose here a simplified expression in 
which, conserving the design oriented delay representation we introduce pseudo 
empirical non linear correcting terms such as: 



ThL = §n ■''TN 

I Cm tst (l- vtn) 

CN 



l-(ank^ + Pnk + en][;^ 

Vhls 



\yn 



+ tHLS 

where the output slope is still obtained from [9] as: 

'Tout = 2tHLS,LHS / 



(1 



nc , *HL,LH 

0.5 - Vxn,TP + 2 



(7) 



( 8 ) 



The three terms of eq.7 represent respectively the input slope effect, the I/O 
coupling responsible of the overshoot and the loading effect through the step response 
tjjpj, with the correcting factor for slow input ramp induced non linear effects. 

These expressions have been validated on different processes (0.35, 0.25 and 
0.18pm) resulting in an accurate modeling of the switching delays and output 
transition times over a large range of design space (less than 10% of discrepancy with 
respect to HSPICE simulations performed using the foundry model and simulation 
level) 

Illustration of this comparison between calculated and simulated delay and 
transition time values for inverters with different configuration ratios is given in Fig. 
3 and 4. 



3 Extension to Gates 

We present in this section an extension of the preceding model to simple Nand, Nor 
gates. The idea is to treat the gates as “equivalent” inverters through the evaluation of 
reduction factors reflecting the effect of the serial array of transistors on the current 
possibilities of the gate. 

Let us for example consider a two input Nand gate with equal width W^^q for the N 
transistors of the serial array. For identical load and input controlling ramp the 
switching delays are different depending on the switching transistor in the array. Due 
to their biasing or controlling conditions the two transistors of the serial array have 
different current possibilities. 
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Fig. 3. Comparison of simulated and calculated falling and rising delay values for inverters 
with configuration ratio ranging from 0.25 to 3, implemented in a 0.18pm CMOS process. 




Fig. 4. Comparison of simulated and calculated falling (A) and rising (B) output transition time 
values for inverter with configuration ratio ranging from 0.25 to 3, implemented in a 0.18pm 
CMOS process. 



The gate control voltage of the transistor close to the output (the top transistor of 
the array) is lower than the output voltage of the controlling gate due to the ohmic 
voltage drop of the conducting bottom transistor (close to the ground). 

In the same way this bottom transistor suffers from a power supply reduction due 
to the threshold voltage reduction through the top transistor working as a transmission 
gate. 

In this condition each input must be considered separately in order to develop an 
equivalent inverter representation from the evaluation of the switching current 
possibilities of the gate. 

Let us consider that the gate controlling input is the top input. The current supplied 
by the top transistor of width is smaller than that of an inverter of same size. 
This is due to the reduction of the applied gate source voltage due to the ohmic drop 
occurring in the bottom transistor. This current reduction can be easily evaluated from 
eq. 1 and found to be equivalent to a reduced width transistor. 

In this way the inverter to the Nand gate controlled on the top input that is with the 
same current possibility can be defined with an N transistor of width: 



IT 



Eq 



IT. 



Red 



Sat 



(9) 



where Red^^^ is the current reduction factor previously discussed. 
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Using the delay model developed for inverters this equivalent inverter will exhibit 
the same delay performance than the Nand gate controlled on the top input. 

Let us consider the situation where the bottom transistor is switching. Due to the 
threshold voltage level degradation introduced by the top transistor the equivalent 
inverter can be easily deduced replacing the serial array by a transistor of width 
but supplied through a reduced voltage equal to Note here that the top 

transistor and the output capacitance as explained in [9] will load the equivalent 
inverter. 

In fact as observed, the working mode of this bottom transistor varies depending 
on the value of the input slew. For fast input ramps the intermediate node is 
discharged faster than the output one, in this case the current in the array is limited by 
the top transistor. For slow input ramp value the current is controlled by the bottom 
one. This is illustrated in the Table 1 where we can verify that for fast input ramps the 
step responses are identical for bottom or top controls. For slow input ramp 
conditions the reduction in supply voltage results in faster speed desaturation effects 
of the carriers in the bottom transistor, resulting in a smaller than that of the top 
one. 

Table 1. Comparison of top, middle and bottom step response values of a 3 input Nand. 



3 inputs Nand Falling Step Responses 
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The last case to consider is a transition controlled by the middle input. For fast 
input ramps, as shown in Table 1, the top transistor fixes the middle step response too. 
For slow input ramp, the transition is treated in two times as a combination of a top 
and a bottom commutation, as shown in Fig. 5. 




Fig. 5. Two steps middle input transition modeling. 
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The validation of this reduction protocols for the delay and the output transition 
time has been obtained by comparison with respect to SPICE simulations for 2 to 4 
input Nand and Nor gates. As illustrated in Fig. 5 and 6 good agreement has been 
observed between simulated and calculated values over a large design range (less than 
10% discrepancy for the delay and the slopes). 




Fig. 6. Comparison of simulated and calculated falling delay values for 3 input Nand gate for Top, 
Middle and Bottom input implemented in a 0.18pm CMOS process. 




Fig. 7. Comparison of simulated and calculated rising delay values for 2 input Nor gate for Top and 
Bottom input implemented in a 0.18pm CMOS process. 



4 Conclusion 

Severe challenge in deep submicron design is to accurately predict timing 
performance of designs at all level of synthesis. For that we presented a second 
generation of delay performance modeling of CMOS structures considering the non- 
linear dependency of delay on controlling parameters. Based on a metric defined to 
characterize the process performance and the output transition time we defined a 
complete and design oriented model of delay for inverters. This model has then been 
extended to gates using a reduction protocol considering the rank of the gate 
switching input. Validations has been obtained by comparing the calculated values of 
delay and output transition time of inverter and gates to values deduced from SPICE 
simulations using the foundry card and simulation model defined for a 0.18pm 
CMOS process. 
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Extension to complex gates and to the management of timing closure during place 
and route is under development. 
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Abstract. A structural discipline for constructing speed-independent 
(hazard-free) circuits based on canonical chains of set-dominant and 
reset-dominant latches is proposed. The method can be applied to de- 
compose complex asymmetric C-gate generated by logic synthesis from 
Signal Transition Graphs, and to map them into a restricted gate ar- 
ray ASIC library, such as IBM SA-12E that consists of logic gates with 
maximum four inputs and includes A012, AOI12, OA12 and OAI12. 
The method is illustrated by new implementations of practically use- 
ful asynchronous circuits: a toggle element and an edge-triggered latch 
controller. 



1 Introduction 

Asynchronous circuits offer promising advantages for circuit design in deep- 
submicron technology, amongst which the most attractive are low power, EMC, 
modularity and operational robustness. As systems-on-chip become a reality, 
design of asynchronous control circuits that can tolerate variations in timing pa- 
rameters of components is particularly important. Examples of such circuits are 
interface controllers [1]. A class of asynchronous circuits that are insensitive to 
gate delay variations is Muller’s speed-independent (SI) circuits [2]. An exten- 
sive research has been in methods and algorithms for synthesis of SI circuits in 
the last decade [3]. A software tool, called Petrify [4], can synthesise a SI circuit 
from its Signal Transition Graph (STG) specification [5] if the latter satisfies the 
basic implementability conditions [3] . The result of synthesis is a circuit in which 
each non-input signal is a generalised or asymmetric C-gate [6] (see Section 2). 

The property of acknowledgement is characteristic to SI circuits compared 
to their less conservative counterparts, such as Burst-Mode circuits [7] or Timed 
circuits [8,9]. According to this property, every transition of each gate output is 
acknowledged by another signal, which allows the circuit to operate correctly for 
unbounded gate delays. Guaranteeing this property, however, is a difficult task, 
particularly if the circuit realisation is restricted by a given gate library. Petrify 
can perform logic decomposition using a gate and latch library, in which compo- 
nents can be restricted to a given number of input literals. In order to preserve 

* On leave from: Institute for Analytical Instrumentation, Russian Academy of Sci- 
ence, St. Petersburg, Russia; work in Newcastle supported by EPSRC GR/M94359. 
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SI property after logic decomposition, Petrify seeks for the newly emerging gate 
outputs to be acknowledged by other signals, or in the case of complements of 
signals assumes the delay of input inverters (“bubbles”) to be equal to zero. This 
is a limitation that is not present in our canonical decomposition of generalised 
C-gates. 

This paper addresses the problem of the gate level realisation of SI circuits 
for CMOS ASIC libraries in which cells may have a limited number of inputs, e.g. 
three. A regular method for constructing a class of speed-independent circuits 
composed of 3-input gates A012, AOI12, OA12 and OAI12 is presented. These 
gates implement monotonic Boolean functions d = a+be, d = a + be, d = a{b+e) 
and d = a{d + e), respectively. Most CMOS gate libraries include these elements. 
For example, the IBM SA-I2E gate array library [10] offers such gates with 
high speed and low power consumption, and compared to other 3-input (simple) 
gates, such as AND3 and OR3, their functional capabilities are greater - one can 
construct latches out of them. E.g., a simple state-holding element d = a + bd 
{set-dominant latch) can be built out of just one A012 if output d connected to 
its input e. We present examples of the application of our construction method, 
by showing two new implementations of practically useful circuits, one is a toggle 
element and the other is a pipeline stage (latch) controller. Both circuits are built 
as chains of the above mentioned positive and negative gates. They are totally 
speed-independent, they do not have zero delay inverters, and thus compare 
favourably to the existing solutions. 

Negative gate circuits attracted attention about three decades ago since it 
was noticed that basic CMOS gates have inherent output inverters and thus 
implement decreasing, or negative monotonic. Boolean function [11,12,13]. Later, 
interest to negative asynchronous circuits arose when it became clear that they 
consume less power than their non-negative counterparts [14,15,16]. 

The rest of the paper is organised as follows. Basic latches are introduced in 
Section 2. Positive latch chains for the implementation of asymmetric C-gates 
are described in Section 3. Negative chains and circuit reduction methods are 
presented in Section 4. Section 5 illustrates applications, a toggle element and 
edge-triggered latch control circuits. Analysis of behavioural correctness of our 
circuits is discussed in Section 6. Section 7 contains the conclusion. 

2 Basic Latches and Notations 

A latch built of a single A012 element, known as a set-dominant latch, is shown 
in Fig. 1(a). Its behaviour is described by the STG depicted in Fig. 1(b), where 
denotes the rising signal edge and the falling edge. Signals a and b are 
inputs, signal d is an output. The solid arcs depict casualty relations within the 
circuit, whereas the dotted arcs describe the environment behaviour. 

Following the STG in Fig. 1(b), transition a-|- causes transition d-\- while 
transition d— is caused by the firing of a— and b—. The signalling discipline 
between the latch and the environment assumes that transitions at a and b inputs 
may only occur after signal d becomes stable. This latch can be considered as a 
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d) e) f) 

Fig. 1. Basic latches: &-gate(a), its behaviour(6) and notation(c); p-gate(d), its be- 
haviour(e) and notation(/); g, h - negative h- and p-gates and their notations 



simple case of an asymmetric C-gate [6] as shown in Fig. 1(c). Input a in this 
drawing, being connected to the main body of the symbol, controls both d+ and 
d—, while input b, being connected to the extension marked controls only 
d-. 

A dual circuit {reset-dominant latch) can be built of OA12 gate. Its schematic, 
STG and symbolic notation are shown in Fig. l(d,e,f). In this case, input a con- 
trols both edges of d while p controls only the rising edge of d. The shapes of 
symbols in Fig. l(c,f) look similar to Latin characters “b” and “p”, which we 
will use to denote the asymmetric C-gates as h-gate and p-gate respectively. The 
latches with inverted outputs, shown in Fig. l(g,h), will be denoted as 6-gate and 
p-gate respectively. In the following sections the latches of 6, p, h and p types are 
used as building blocks to construct more complex components of SI circuits. 

3 Generalised Asymmetric C-gates 

3.1 Homogeneous Positive Latch Chains: Generalised Latches 

A homogeneous chain comprising 6-gates only is shown in Fig. 2(a). Such a 
circuit, denoted as 6", where n is the number of stages, implements a gener- 
alised C-gate with single input a controlling both edges of signal d and n signals 
6i,...6„ controlling d— only (set-dominant latch). A dual circuit, denoted as 
p"*, where m is the number of stages, is shown in Fig. 2(b). Note, that 6" and 
p'" chains are transitive, so any pair of gates within a chain can be swapped 
places without affecting the external specification of the chain. 

Similar chains of more complex latches can be constructed. An example of a 
three input p-gate (a,pi,p 2 ) is shown in Fig. 2(c). Its transistor-level implemen- 
tation could be simple, being just one transistor pair larger than a 6-gate. Such 
an element is not present in most gate array libraries and, therefore, will not be 
considered. However, it can be implemented as a p^-chain. 

3.2 Heterogeneous Positive Latch Chains: C-gates 

Any asymmetric C-gate can be constructed as a composition of two generalised 
6 and p-latches (see Fig. 3(a)), which results in a heterogeneous latch chain. Two 
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Fig. 2. Homogeneous b^-chain (a); p”‘-chain (6); single gate realisation of p^-chain (c) 



examples of a 3-input asymmetric C-gate, based on two simple b and p-latches, 
is shown in Fig. 3(b). 




Fig. 3. Heterogenous latch chains: fe"p"‘-chain for generalised asymmetric C-gate im- 
plementation (a); pb and 6p-chains (b); 2-input symmetrical fep-chain C-element (c); 
Mayevsky C-element; 3-input (e) and 4-input (f) C-element 



Both chains in Fig. 3(b) are equivalent in their functionality, though having 
different signal delays from input 61 (or pi) to output d. The chain function is 
preserved under any transposition of 6 and p-gates. Hence, any heterogeneous 
chain consisting of n 6-gates and m p-gates is functionally equivalent to the b'^p"^- 
chain. Both bp and p6-chains can be used to implement a two-input symmetric 
(Muller) C-element [18] as shown in Fig. 3(c). This realisation favorably com- 
pares to the known Mayevsky [19] C-element shown in Fig. 3(d). The following 
list contains pairs of parameters for comparison of the p6-chain against Mayevsky 
C-element in the CMOS AMS-3.11 0.6p realisation: 2/5 gates, 16/25 transistors, 
1.51/1.85(ns) cycle time, 11.9/21.5 pJ energy per cycle and 699/1748 area. 

3.3 Chain Length Reduction 

Serial connection of elements in a 6"p'"-chain may cause a significant delay. In 
many cases the chain length can be reduced by using a simpler generalised C-gate 
(with less inputs) and an expansion circuit comprising AND/OR gates. 
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A traditional expansion solution [17], shown in Fig. 3(f), uses an OR gate 
to detect the all-zeroes input state (the condition of switching to zero) and an 
AND gate to detect the all-ones input state (the condition of switching to one). 
The outputs of these gates are connected to the inputs of the symmetrical C- 
gate. It is easy to check that all signals in this circuit are acknowledged under 
the assumption of wires having no delay (Muller’s SI model). This method is 
applicable only to the symmetrical C-gate inputs, i.e. to those which control 
both events of output switching to 1 and to 0. 

A new compact solution to the expansion problem is shown in Fig. 3(e). 
It uses a single 6-latch instead of the symmetric C-gate. This improvement is 
achieved at the expense of an additional circuit (connecting the output of the 
OR-gate to an additional input of the AND gate) providing the acknowledgement 
of 1 at the output of the OR-gate. A disadvantage of this solution is the number 
of possible inputs reduced by 1 in comparison with Fig. 3(f). 



4 Negative Latch Chains 

4.1 General Properties 

Note that connecting p or 6-gate to the symmetrical input a of a 6"p"*-chain 
implementing a generalised C-gate is equivalent to adding an input to or 
extension of the C-gate, as shown in Fig. 4(a) for a 6-gate. The rule of adding 
inputs to the extensions for negative latch chains is more complicated. 










b2^ 
bl- 
— a 

p2 . 

p2n - 



d 



Fig. 4. Negative latch chains transformations: connecting 6-gate to C-gate input (a); 
duality (De Morgan’s) rule for 6- and p-gates (6); connecting b-gate to C-gate (c), con- 
version of 6p-chain to an asymmetric C-gate, 6p^ 6-chain (e), 6 -chain (/); transparent 
latch implementation (g), example of complex negative chain (6) 
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The functionality of a p-gate is equivalent to that of a 6-gate with all inputs 
inverted. Output d of a ^gate (see Fig. 4(b)) gets 1 as soon as a = 0 and it gets 
0 as soon as a = pi = 1. The same for a 6-gate with inverted inputs: if a = 1, 
then output d gets 1 and if d = pi = 0, then output d gets 0. This corresponds 
to DeMorgan’s laws. 

Connecting p or 6-gate to the input of a generalised C-gate is equivalent to 
adding an inverted input to or extensions of the C-gate, respectively, 
and inverting its input a. The example in Fig. 4(c) illustrates this for the 6-gate. 

Let us consider chains consisting only of p- and 6-gates, starting with the 
simplest case of a heterogeneous negative 6p-chain shown in Fig. 4(d). Using the 
above transformations one can see that such a chain results in an asymmetric C- 
gate with two inputs connected to extension. The symmetric chain 6p^6 leads 
to a more symmetric generalised C-gate depicted in Fig. 4(e). In general, each 
p and 6-gate contributes to either or extensions, respectively, without 
input inverters if the signal path leading from the input of this gate to the chain 
output includes odd number of inverters. If such a path includes an even number 
of inverters (bubbles), then p-gate (6-gate) contributes an inverted input to the 

(”-!-”) extension. 

The immediate consequence of this claim is that any transposition of odd 
gates in a negative chain preserves its functionality. The same takes place for even 
gates. Hence, each negative chain is functionally equivalent to p"(p6)’"(6p)*6 , 
where m, fc = 0, 1, 2 , . . . ; n, / = 0, 2 , 4, . . . 

_2 

A special case of such a chain, namely 6 -chain (see Fig. 4(f)), is useful for 
transparent latch realisations. The transparent latch with “enable” input t, data 
input a and output d, which is transparent when t = 0 and opaque for t = 1, 
can be implemented as a generalised C-gate shown in Fig. 4(g) with t connected 
to both extensions. 

Finally, as a more complex example, a heterogeneous (p6)"(6p)"-chain for a 
generalised C-gate with n inputs in both and extensions, with half of 
them being inverted, is given in Fig. 4(h). 



4.2 Reduction of Negative Chains 

Negative chains comprising 6 and p-gates look more complex than those com- 
prising 6 and p-gates. However, the structure of 6 and p-gates, if implemented 
by CMOS circuits, includes two inverters: the first is the inherent inverter of 6 
or p-gate and the second is output inverters depicted in Fig. l(g,f). These in- 
verters can be removed without changing the chain function in semi-modular 
applications. Such a circuit is simpler and faster than its positive counterpart. 

A new realisation of inverting transparent latch can be derived from Fig. 4(g). 
The circuit in Fig. 5(a) is obtained by refining the 6 and 6-gates. Note, that signal 
transitions propagate from left to right in that negative gate chain in such a way 
that, in any cycle which starts after signal d assumes a new value, signal g accepts 
the value of signal / with some delay. Therefore, the feedback from / to the input 
of the left-most 6-gate can be replaced by the feedback from output g. Further, 
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Fig. 5. Reduction of negative chains for: transparent latch application (o), C-gate 
with a few inputs inverted (6), the same with built-in transparent latch (c), 2-input 
C-element with built-in transparent latch 



the transitions at / and e do not affect any other signal in the circuit. Hence, the 
inverters at / and e can be safely removed without affecting the circuit function. 
This approach can be applied to any negative chain, as shown in examples in 

Fig. 5(b,c,d). A three-input C-element with two inverted inputs (see Fig. 5(b)) 
_2 

can be realised as pb p-chain. A similar C-gate with an additional built-in trans- 
parent latch can be obtained from a reduced &^p6^p-chain as shown in Fig. 5(c). 

_2 

A mixed negative-positive b bp-chain, consisting of negative and positive latches, 
realises a two-input symmetric C-element with a built-in transparent latch (see 
Fig. 5(d)). This solution can be seen as a Muller C-element enhanced with the 
enabling/blocking input t. 



5 Examples Based on Reduced Negative Chains 

We have, so far, considered only applications mapped easily on latch chains. We 
will now consider other applications, which are compositions of two chains. These 
implementations present new solutions to the known practical circuit designs. 
They illustrate the power of the reduction approach applied to a chain that is a 
backbone of the application. 

Toggle. A toggle is one of the key elements in constructing self-timed mi- 
cropipeline controllers [20], with two-phase signalling discipline. The STG of 
a toggle element is shown in Fig. 6(a). It responds to each even (odd) transition 
on input x with a transition on output yl (j/2). 

A known solution for toggle circuit [21] based on two transparent latches 
with different polarity of the control signal x is shown in Fig. 6(b). We propose 
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Fig. 6. A toggle circuit: STG (a), transparent latch-based realisation (b), reduced 
negative circuit(c), refined STG (d) 



the implementation shown in Fig. 6(c), which is based on the transparent latch 
shown in Fig. 5(a). Its STG is given in Fig. 6(d). 

Being implemented in IBM SA-I2E gate array library (0.25/i, 2.5V), this 
circuit has the following delays from input x to outputs y\ and y 2 ~. d{yi+) = 
0.29ns, d{y 2 +) = 0.19ns, d(jji—) = 0.30ns, d{y 2 ~) = 0.24ns . 

Edge-triggered latch control circuit. The edge-triggered latch control circuit 
described in [6] has the STG shown in Fig. 7(a). We refine the implementation 
based upon asymmetric G-gates [6] using our basic negative chains. The circuit 
in Fig. 7(b) is obtained by further reduction and simplification. 




Fig. 7 . Edge-triggered latch control: STG (a), circuit (6) 



6 Behavioural Correctness 

Semi-modularity [2] of two above examples was checked by Versify tool. All 
other our proposed solutions are also semi-modular, i.e. no hazards are possible 
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under the correct environment behaviour defined in Fig. l(b,e). The intuitive 
reasoning behind this claim (the proof is omitted) is that under such a discipline 
every transition on the output is followed by a single transition of each input, 
which in turn eventually acknowledged by the next output transition. 

All the properties described above have been obtained under assumption of 
monotonic environment behaviour. That is if the circuit input is set to some 
particular value, which is the necessary condition of the output event, then this 
value must not change until the output event happens. All our circuits are robust 
to monotonic environment behaviour. However, there are applications where the 
environment, being semi-modular (hence SI), is allowed to withdraw such an 
input, providing that the output is not excited. This environment behaviour is 
non-monotonic. 

Such non-monotonic inputs, being applied to a circuit comprising several 
stages may cause switching of the internal signals. Under the above condition of 
the output being not excited the events on internal signals are not acknowledged 
at the circuit output, which may result in hazards. 

Latches and chains shown in Fig. 1-5 may produce hazards in a non- monotonic 
environment. The robustness analysis of the proposed circuits in non- monotonic 
environments is the subject of the future work. 

7 Conclusion 

A method of speed-independent asynchronous controllers design, using a limited 
fan-in gate library, has been developed. It is based on chains of set-dominant 
and reset-dominant latches. Several regular structures comprising positive and 
negative chains are studied and a reduction technique is used at the latch level. 
Our method can be applied to decompose complex asymmetric gate implementa- 
tions generated by logic synthesis tools (such as Petrify) from Signal Transition 
Graphs, and to perform mapping into a restricted ASIC gate array library, such 
as IBM SA-12E (contains logic gates with maximum three-four inputs and in- 
cludes A012, AOI12, OA12 and OAI12 logic gates). No assumptions on inverter 
(bubble) delay are used. The method has been illustrated by the new implemen- 
tations of practically useful asynchronous building blocks: a toggle element and 
an edge-triggered latch controller. 
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Abstract. Improving processor performance pushes designers to look for every 
possible design alternative. Moreover, the need for embedded processing cores 
exhibiting low power consumption and reduced EM noise is leading to changes 
in system design. This trend has suggested the adoption of self-timed systems, 
whose energy and noise characteristics depend upon the processed data and the 
processing rate. In this paper, we explore the design space of first-in first-output 
queues, which are fundamental component in most recent proposals for 
asynchronous processors. Different strategies have been refined and evaluated 
using the handshake circuits methodology. 



1 Introduction 

The adoption of sub-micron technologies is raising questions about the processor 
design methodology to be adopted [1]. Wires stretching from one side of the chip to 
another are starting to behave as if transmission lines with delays (quantized in clock 
cycles) varying with both layout and process technology [2]. Processor architectures 
and implementations offering critical paths dominated by gates instead of 
interconnections will be thus preferred to actual implementations, which focus on 
reducing gate delay on the critical path. Within this scenario, the asynchronous circuit 
design discipline may prove helpful, thanks to its reliance on resource and 
communication locality [3]. 

In order to exploit asynchronous system design, recent proposals have been 
evaluated which aim at the development of an asynchronous-friendly architectural 
template [4-6]. Such asynchronous-friendly architecture should exploit features like 
decentralized control, de-coupling and data-dependent computation. The ideal 
situation would be to have an architectural template with all resources divided among 
fully de-coupled clusters, which only interact for data communication. 
Communication is generally implemented by means of first-in first-out queues 
(FIFO’s) in order to improve elasticity and de-coupling among resource clusters. Such 
FIFO’s can be either transparent to the compiler (i.e. dynamically allocated) or treated 
as simple registers. In the latter case, they also prove valuable in optimizing streaming 
computation such as in DSP applications [4,5]. 

Therefore, the choice of FIFO architecture is an important design parameter, which 
may affect processor performance in its normal working load. In this paper, we 
explore the design space of asynchronous FIFO queues. Different solutions, among 
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which two interesting schemes, are analyzed and compared on the basis of their 
latency, throughput, area and power consumption. In this way, a complete picture of 
asynchronous FIFO design can be drawn and used in selecting what approach is more 
suitable for a given design. 

This paper is organized as follows. The design methodology is described in Section 
2. The FIFO architectures under analysis are described thorough Section 3, 4 and 5. 
Two novel FIFO architectures, which refine a more conventional one, are also 
discussed in Section 5. Results and comparison are given in Section 6, whilst some 
conclusions are drawn in Section 7. 



2 Introducing Handshake Circuits 

The methodology adopted for the designs described in this paper is based on 
handshake circuits [7]. A handshake circuit is a connected graph that consists of so- 
called handshake components, which communicate and synchronize each other 
through handshake channels. 

In general, a handshake circuit may features control components, which only have 
communication channels not carrying information, and data components, which carry 
information. In between there are the so-called interface components; these can 
perform handshaking with or without carrying information at the same time. 

The behavior of a handshake channel in each component can be classified into 
active or passive channel ports. An active channel port raises the request signal and 
waits for acknowledgment, whilst a passive one waits for the request and raises the 
acknowledgment. Components can features a different mix of channels leading to 
complex behavior such as pull components (passive input and active output ports), 
push components (active input and passive output ports), passive components (passive 
input and output ports), and so on. This allows to build a minimum set of components 
featuring the required basic operation upon which build more complex asynchronous 
systems. 

Such basic components can be thus implemented using different asynchronous 
design styles or even a synchronous approach as recently discussed [7]. In this paper, 
we have considered only single-rail four-phase handshake circuits, because they 
previously proved a more efficient choice in terms of power budget and performance 
[ 8 ]. 



3 Standard FIFO Architecture 

A standard asynchronous first-in first-out queue is based on cascaded buffering 
stages, which behaves like a pull component (Figure 1). Each stage waits for a new 
input data on its unique passive input channel and, then, outputs it through its active 
output channel. The functionality of each stage is thus minimal and requires very low 
design complexity - it is equivalent to two logic gates and a register. 
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Fig. 1. Standard ripple FIFO architecture 
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Fig. 2. Circular FIFO architecture with mutual exclusion arbitration 

The standard -sometimes called ripple - FIFO queue presents the advantage of 
being extremely modular, since it can be expanded by simply cascading additional 
stages. Nevertheless, its main fanlt is evident: pushed data has to go through - viz. 
ripples - all stages in the FIFO qneue before being popped. Therefore, minimnm 
latency and power consumption are determined by the number of stages in the queue, 
whilst its throughput is determined by the number of full stages (i.e. tokens) and 
empty stages (i.e. bubbles). If the number of tokens is constantly equal to the number 
of bubbles, throughput is at its peak and latency is at its minimum. If the number of 
tokens is higher, both thronghput and latency worsen. Otherwise, only throughput 
worsens. Power consumption per token is constant, while overall power budget 
depends on both throughput and number of stages. 



4 Arbitrated Circular FIFO Architecture 

A different approach is to implements asynchronous FIFO queues adopting a scheme 
based on a memory-like scheme (Fignre 2). Push and pop operations are executed 
through an input and an output port, which respectively store data in and read data 
from an internal memory implementing the queue slots. In this case, each port has 
knowledge of the last read and written slots - i.e. memory addresses - in form of tail 
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and head pointers. These pointers are updated depending on the current operation 
generally leading to performance rednction. 



bujfering pmcewes 




Fig. 3. De-coupled circular FIFO using buffering processes and centralized control. 

This approach may improve performance: in fact, each token goes through only each 
port and a single slot. Therefore, latency and power demand should improve whilst 
throughput should worsen. Unfortunately, such a picture is easily drawn in 
synchronous systems, where pointers can he transparently updated once per clock 
cycle. In the case of an asynchronous design, both ports must be explicitly 
synchronized in order to ensnre that tail and head pointers are consistent in the current 
operation for any possible sequence of operations. This implies that these pointers are 
a shared resource that should be accessed in mutual exclusion. Therefore, arbitration 
- through a semaphore as in Figure 2 - is required in order to grant access to this 
shared resource to only one port at a time. If multiple concurrent requests were raised, 
arbitration could incur in additional speed penalty because of internal metastability 

[3]. 



5 The De-coupled Circular FIFO Architecture 

Both the ripple and the arbitrated FIFO schemes present a single distinctive feature, 
which is mostly responsible for their different performance. In the ripple FIFO, each 
slot behaves like a buffering process that does not accept new data before the last one 
has been transmitted. In the arbitrated scheme, the slot itself is just a memory location 
and correct sequence of events is ensured by means of memory pointers (tail and 
head) which are handled by coupled I/O ports. The former features lead to a scheme 
without arbitration, whilst the latter one to smaller latency and power budget thanks to 
absence of rippling. 
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Fig. 4. Circular FIFO using buffering processes and distributed input control (token lines). The 
marked slot is the first active one after reset. 




token out 



Fig. 5. Circular FIFO using buffering processes and fully distributed control (token in/out 
lines). The marked slot is the first active one after reset. 

An evident improvement on both schemes could be achieved when the process-like 
slot - ripple scheme - is combined with the notion of memory pointer - arbitrated 
scheme - as in Figure 3. In such a FIFO queue, both input and output ports are de- 
coupled, since they do not interact hy means of a semaphore. The correct sequence of 
operation is ensured by the fact that each slot will automatically lock itself until its 
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latest data has not being transmitted. Such a scheme avoid the penalty of a semaphore 
by adopting a more complex FIFO slot, which should result in improved performance. 

This scheme can be further refined by moving the functionality of each port into 
the process-like slots. Each port could be replaced by a communication ring by which 
a slot is initially enabled as next active slot for a given action. In this way, we may 
obtain a scheme with either only one (Figure 4) or no I/O ports (Figure 5). The 
advantage of such schemes is obvious; no pointer is required thus improving latency 
and reducing power consumption. Moreover, the additional handshaking required to 
enable the following slot along the rings is easily hidden in the normal operation 
cycle. Nevertheless, they require broadcast of the input and output channels as well as 
an increase in slot complexity that could penalize the throughput. 



6 Analysis and Comparison 

We have evaluated the different schemes implementing a ripple FIFO (SF), an 
arbitrated circular FIFO (AF), a de-coupled FIFO with a single output port {iRF) and 
one with no ports (ioRF). All designs are 32-bit wide with variable depth and running 
in self-oscillation mode. Results are reported in Figure 6 for latency. Figure 7 for 
throughput. Figure 8 for power demand and Figure 9 for area. All design are based on 
a slow 0.8um@5v technology, which makes absolute values of no interest whilst 
useful for comparison purposes. 



Latency 




Fig. 6. Latency of FIFO designs respect their depth 

Simulating FIFO queues in self-oscillating mode has implications on the obtained 
simulation results that have to be taken into consideration when analyzing the 
obtained results. A self-oscillating ripple FIFO will reach a stable equilibrium 
corresponding to maximum throughput and minimum latency - viz. equal number of 
tokens and bubbles. Therefore, in a realistic execution mode its performance would be 
worse than here considered with lower power budget eventually. A self-oscillating 
arbitrated FIFO instead reaches a stable equilibrium, which will never cause 
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metastability in the semaphore. Therefore, the values in Figure 6, 7 and 8 represent 
the peak throughput, minimum latency and minimum power budget. 



Tfirougriput 




Fig. 7. Throughtput of FIFO designs respect their depth 




Fig. 8. Power of FIFO designs respect their depth 



Self-oscillating de-coupled FIFO’s will generally reach a stable equilibrium with 
the output always starving for new data. In this case, throughput will be the minimum 
one and distributing the output port will not result in a sensible improvement for the 
ioRF design. Therefore, in a different execution situation we expect the throughput of 
the de-coupled schemes to gain on the other two schemes - especially for the fully 
distributed ioRF scheme over the iRF one. Latency and power budget are not sensibly 
affected 
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Design effective area 




Fig. 9. Area of FIFO designs respect their depth 

The de-coupled scheme proves more efficient in terms of latency and power 
budget, whilst the ripple one leads to higher (peak) throughput and smaller area. 
However, we expect the de-coupled scheme to prevail in generic execution that can 
sensibly differ from the self-oscillating mode. The arbitrated scheme is generally 
worse and its performance is expected to worsen once metastability penalty is 
considered. 



Conclusions 

In this paper, we have explored the design space of asynchronous FIFO queue. 
Different solutions, among which two novel de-coupled schemes, have been analyzed 
and compared on the basis of their latency, throughput, power consumption and area. 
In this way, a complete picture of asynchronous FIFO design has been drawn. The 
novel de-coupled schemes prove a good choice when latency and power budget are at 
stake, whilst sensible area penalty is introduced. Throughput is smaller respect to the 
peak one of a standard ripple scheme: however, when the fill and empty rates of the 
queue differ, the throughput gap is expected to lower sensibly. 
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Abstract. In this paper, several self-checking carry-propagate adders are 
examined and compared in terms of area integration, power dissipation and 
performance. Real-time detection of any single fault, permanent or transient, is 
ensured for all the presented circuits while the characteristics of each adder are 
illustrated. The results indicate that the characteristics of the adders change 
when safety mechanisms are applied. The constraints, also, of the required 
system design dictate the appropriate adder. 



1 Introduction 

Low-power dissipation has become a critical issue, in the VLSI design area, due to 
the wide spread of portable and wireless applications and their need for extended 
battery life. Especially, in the increasing 8-bit market, low-voltage and low-power 
microcontrollers have made their appearance, challenging even dominant 8 -bit 
conventional architectures such as 68HCxx, 8051/8031. 

Portable systems, targeting the medical applications market, require highly safe 
operation (fail-safe systems), apart from the low-power dissipation. Erroneous 
functionality of the system, due to system failures, is not acceptable and on-line 
detection and indication of the error is desirable. The real-time system constraints 
must also be satisfied. 

The design of highly reliable and safety systems leads to the use of additional 
hardware and/or software overhead (safety mechanisms). The required safety levels of 
the targeted application, which are derived from the international safety standards, 
affect notably the needed overhead [2]. In general, the conventional approach of such 
requirements employs either double-channeled (or multi-channeled) architectures 
(e.g. two microcontrollers in parallel), which continuously compare their data, or the 
use of safety mechanisms. These mechanisms detect faulty operation, for each 
functional unit of a specific architecture. 

The first approach leads to a significant increase of the hardware requirements and 
the power dissipation of the system, thus it is not recommended when low-power 
dissipation is of great importance. The other approach leads to low-power dissipation 
of the system when several low-power techniques are applied. 
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This new approach, which allows on-line error detection, is based on the use of 
coding scheme techniques to obtain redundant information able to detect transient and 
permanent faults in a significantly lower cost than that of the previous approach. 
Thus, the class of self-checking circuits and systems has been created. A self- 
checking circuit consists of a functional unit, which produces encoded output vectors 
and a TSC checker, which checks the vectors to determine if an error has occurred. 
The TSC checker has the ability to give an error indication even when a fault occurs 
in the checker itself. 

Although a lot of safety mechanisms, that fulfill almost all the possible functional 
units have been presented [1], [2], [3], [4], [5], [6], [7], no concern is given for their 
power dissipation and optimized implementations depending on the specifications. 
Though less hardware is required for such systems, nothing guarantees that its power 
dissipation is minimized. In [2], the hardware and power requirements of self- 
checking architectures for common data path and data storage circuits were examined. 
The circuits examined were implemented for different coding schemes and use 
standard cell technology. The detection of any single fault, permanent or transient was 
ensured for all the proposed circuits while the effectiveness of each coding scheme in 
the detection of double and triple faults is also determined. 

In [8], various implementations of adders are examined in terms of area, power 
dissipation and performance. In [8], also, the effect of each of these three major terms 
is illustrated. Although, the effect of these terms is well known, no concern was given 
to create safety mechanisms for these units, nor to explore for the existing ones, their 
characteristics. In this paper, a study on several, fault-secure carry-propagate adder 
implementations, in terms of area, power dissipation and performance, takes place. 
The examined adders are the one proposed in [7], a slightly modified of the latter 
implementation and the Carry-Complete full adder, which is adapted to the safety 
requirements. The rest of this paper is organized as follows: in section 2, basic 
background is presented. In section 3, the examined implementations are described 
and in section 4, results from the comparison of these implementations, in terms of 
area, power dissipation and performance is illustrated. Finally, in section 5 several 
conclusions are offered. 



2 Basic Properties of Self-Checking Circuits 

Self-checking circuits are used to ensure concurrent error detection for on-line testing 
by means of hardware redundancy. All these circuits aim at the so-called totally self- 
checking goal; i.e. the first erroneous output of the functional block provokes an error 
indication on the checker outputs. To achieve this goal, checkers have been defined to 
be Totally Self-Checking (TSC) and they have to be combined with TSC or Strongly 
Fault Secure (SFS) functional circuits. The terminology of the safety properties of a 
circuit is following, in order to provide the ability to understand terms as the ones 
above. Typical architectures to achieve the fault-secure property are also provided in 
the second subsection. 

The design of a secure circuit has many aspects regarding the needs in safety. A 
circuit G provides the characterization of safety with respect to a fault set F. 

If, for circuit G, with respect to F, for each fe F, there is an input vector applied 
during at least one of the circuit operation modes, that detect F, then this circuit is 
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called self-testing (ST). This property of safety is characterizing the BIST (Built-In- 
Self-Testing) technique. According to this technique, a set of input vectors is capable 
to detect all the single errors presented to a circuit. The disadvantage of this technique 
is that it cannot be applied in safety critical applications because testing is not realized 
in real-time. 

A circuit G is called fault-secure (FS) with respect to a fault set F if, for each fe F, 
when the circuit never produces an incorrect output codeword for any input 
codeword. The above two properties, the self-testing and the fault-secure, when 
combined in a circuit, characterize it as totally self-checking (TSC). The meaning of 
the codeword is strongly related with the code-disjoint property, which will be 
defined later on. 

A circuit G is strongly fault-secure (SFS) if, with respect to F, for every fault in F, 
either: 

1) G is self-testing and fault-secure or, 

2) G is fault-secure, and if another fault from F occurs in G, then, for the 
obtained multiple faults. Case 1 or 2 is true. 

When a coding algorithm is utilized for the inputs/outputs of a circuit, then, if the 
circuit always maps code words inputs into code words outputs and non-codeword 
inputs into non-codeword outputs, the circuit is code disjoint (CD). A circuit that is 
both totally self-testing and code disjoint is a TSC checker. 

A circuit G is strongly code-disjoint (SCD) with respect to a fault set F if: 
before the occurrence of any fault, G is code-disjoint, 
for every fault/in F, either: 

1 ) G is self-testing or 

2) G under fault / always maps non-codeword inputs to non-codeword outputs 
and if a new fault in F occurs, for the obtained multiple faults. Case 1 or 2 is 
true. 

The above definitions are the properties, in terms of safe operation, of a circuit. 
Also, for the fault-secure circuits, a hypothesis is made for multiple errors, all along 
this paper. When an error is present in a circuit, a second one is possible to appear, 
after enough time, so the first one has been detected. This hypothesis may seem 
convenient but it is fully realistic. It is very hard that two errors appear 
simultaneously. When an error appears, the circuit should detect it, in order to be self- 
testing. 



3 Self- Checking Adders 

An adder, in order to be characterized as self-checking, must be constructed of two 
basic blocks, the fault-secure functional block and the TSC checker block. In [7], a 
study on parity prediction arithmetic operators is presented. The basic idea to design 
self-checking adders based on parity prediction is illustrated by the authors that 
propose a self-checking ripple-carry adder. This adder, fig.l, is one of the three 
implementations that are explored in this paper. The main characteristics of this adder 
are the performance, which is proportional to the bitwidth (n) of the input, and the low 
area integration. 

A simple modification to the adder of fig. 1, to the parity prediction logic that 
calculates PCp(i), slightly reduces the glitch effect. The replacement of the cascading 
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Fig. 1. Self-checking ripple-carry full adder 
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XOR gates with a balanced tree of XOR gates, which is a basic low-power method, to 
minimize the “extra” transitions and power in a design, balances all signals and 
reduces the logic depth. Thus, spurious transitions due to finite propagation delays are 
minimized. 

The third adder is the carry-complete, which appears to be similar to the ripple- 
carry adder of [7]. The carry-complete full adder cell is illustrated in fig. 2. The main 
characteristics of this adder are the performance, which is in average proportional to 
the log2 of the bitwidth (n) of the input, and the low power dissipation due to the 
balanced logic stages. Additional information and in depth analysis of the carry- 
complete adder can be found in [9]. Below, follows the proof of the self-checking 
property for the carry-complete adder. 

Following the design methodology of the fault-secure ripple-carry adder found in 
[7], the prediction of the parity is used for the carry-complete full adder as well. To 
predict the parity of an adder, the well-known relationship PS=PA©PB©PC©C„ is 
utilized. This relationship provides the predicted parity of an addition, which must 
coincide with the parity of the produced sum. Any single error to inputs A, B or C„ 
inverts this signal, generating an error indication from a TSC checker. Errors of the 
carry propagated along the stages must, also, generate an error indication. 

The carry-complete full adder is based on the ripple-carry adder modified in such a 
way to include the propagation-complete detection logic. The two carry-in signals are 
given by the equations: 
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C|=A;Bi-l-(Ai©Bi)C|_i (1) 

C° = A~^-l-(Ai©Bi)C‘’_i 

The above equations, if analyzed, prove that C° is complementary to C/. Note that 
when both operands are equal to 0 or 1, the “carry /no-carry” decision can be made 
without waiting for the incoming carry. This property of the carry-complete adder 
makes the completion of the carry propagation faster, depending on the inputs. 

Lemma 1 . If any carry-out pair is guided to a TSC checker, any single error on the 
carry-out signals produces an error indication. 

Proof. The property of the TSC checker determines that any non-codeword input 
produces non-codeword output. 

Lemma 2. If a single error occurs to the output of (Aj©Bj), or AB, or A B then the 
output is safe. 

Proof. If the output of the XOR changes then either an erroneous sum is calculated 
and the correct carry is propagated, so a single error occurs in PS, or the two carry- 
outs are assigned the same value, which from lemma 1 , produces an error indication. 

The above statement proves that, the carry-complete full adder, when combined 
with an n-variable TSC checker, to check the carry-out signals, and the parity 
prediction mechanism, is self-checking. 



4 Experimental Results 

Implementations for the ripple-carry adder and for the double-channelled architecture 
are also considered in this paper, as a reference to the magnitude of area, power and 
performance of the non-safe and the commonly used fault-secure adder respectively. 
The rest of the implementations are the fault-secure ripple-carry full adder presented 
in [7], a modified version of this adder to achieve reduction of glitch effect and the 
proposed fault-secure carry-complete full adder. All adders are implemented for 
operand bitwidth of 4,8,16,32 and 64. 

The first characteristic, of the adders, that is examined is the required area. 
Measures for several technologies are taken using Mentor Graphics DVE tool and 
then their average values are normalized to the non-safe ripple-carry adder. In fig. 3 
the results are illustrated. 

In fig. 3 the implementation of the modified adder of [7] is not presented, due to the 
same area requirements as the ripple-carry adder of [7]. 

The higher area requirements of the carry-complete full adder was expected, due to 
the significant area occupied by the n-variable TSC checker. The double-channelled 
architecture also present a factor of 2,9 compared to the non-safe ripple-carry adder 
due to the contiguous size of the adder and the n-variable TSC checker. 

In [10], a methodology to estimate the power dissipation of circuits is presented, 
using logic simulators and synthesizers. The experimental results concerning the 
power dissipation of the examined implementations are illustrated in fig. 4. The power 
dissipation of the double-channelled architecture is greater than 2 due to the power 
dissipated on the n-variable TSC checker. The experimental results of the double- 
channelled architecture concerning area requirements and power dissipation are not 
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POWER DISSIPATION COMPARISON 




Fig. 4. Power dissipation of the adders 



expedient for real applications. The multi-channelled technique is applicable to the 
outputs of the whole system, thus a systems’ power and area are increased by a factor 
of 2. The last experimental results concern the performance of the implemented 
adders. It would be useful to mention the average delays of the ripple-carry adder and 
the carry complete. The ripple carry full adder, present a delay of (2n-l)x. In contrast, 
the carry-complete full adder in the worst-case operation is still proportional to n, but 
the best and average cases are improved considerably, the former being constant and 
the latter being proportional to log^n. 
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Fig.5. Performance of the adders 

The adders are examined in term of performance and the results are illustrated in 
fig. 5. Note that the results imply that an extra delay is added due to the TSC 
mechanisms at the end of the adders’ stages. Although the influence factor of these 
mechanisms is not of great importance, it must be considered when designing fault- 
secure systems targeting to performance critical applications. 



5 Conclusions 

In this paper, the experimental results, in terms of area, power and performance, of 
the implementations of several self-checking adders, have been illustrated. A fault- 
secure carry-complete full adder was proposed and compared to the other 
implementations. It was proved that for power and performance critical applications 
the carry-complete full adder is advisable, but not when the area overhead must 
remain low. 
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Abstract. Usually, self-timed modules for asynchronous system design are 
realized by means of dynamic logic circuits. Moreover, in order to easily detect 
the end-completion, dual-rail encoding is preferred. Therefore, dynamic 
differential logic circuits (such as Differential Cascode Voltage Switch Logic 
(DCVSL)) are widely used because they intrinsically produce both true and 
inverted values of the output. However, the use of dynamic logic circuits 
presents two main difficulties: i) design and testing is more complex, ii) often it 
is not possible to use standard design methodology. This paper presents a new 
static logic VLSI implementation of a high-speed self-timed adder based on the 
statistical carry look-ahead addition technique. A 56-bit adder designed in this 
way has been realized using 0.6|lm AMS Standard Cells. It requires about 
0.6mm^ silicon area, has an average addition of about 4 ns, and dissipates only 
20.5 mW in the worst case. 



1 Introduction 

Self-timed systems are often attractive as they can compute in mean time, reduce 
power consumption, and avoid long clock connections [1,2]. They use variable time 
computational elements by running just when a request and data word arrive. 
However, designing a self-timed system is not a straightforward task. This is due to 
the fact that events must be logically ordered avoiding races and hazards by means of 
an appropriate handshaking protocol. A handshaking circuit is used to guarantee that 
the computational elements will have stable data inputs during the evaluation phase 
and to allow overlapping of alternating initialization-evaluation phases in adjacent 
computational elements. 

In many applications, the computational elements are self-timed adders. Efficient 
variable-time adders have been widely studied [3,4,5]. Typically, they are realized 
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using CMOS dynamic logic circuitry (e.g. Domino, DCVS). The latter are faster and 
also occupy an area smaller than that required by traditional static circuits. However, 
they are very sensitive to noise, circuit and layout topology. Moreover, they suffer 
from charge leakage, charge sharing and cross talk. Therefore, their usage in an 
asynchronous self-timed system greatly increases the effort required to verify the 
functionality and the performance of the whole system. 

In this paper, we demonstrate that it is possible to realize a high-speed self-timed 
adder using conventional VLSI design methodologies and static logic cells. In this 
way, all the above problems are removed and the designer can redirect his effort 
toward system-level verification of the asynchronous design. 

The proposed adder is based on the statistical carry look-ahead addition (SCLA) 
technique that was recently introduced as a new technique to carry out efficient N-bit 
self-timed adders whose average delay is much lower than log^CN) [6]. One of the 
peculiarities of this technique is that it does not require dual-rail signaling in order to 
detect operation completion. Such an implemented 56-bit adder allows an average 
addition time of about 4 ns to be achieved consuming less than 50% of the power 
dissipated by conventional dynamic logic designs [3]. 



2 Brief Background on Statistical Carry-Look- Ahead Addition 



The statistical carry look-ahead addition technique allows a self-timed addition 
between two N-bit operands A and B to be performed using end-completion sensing 



M 

radix-b full adders as basic elements. Being b = 2 , the adder consists of 

n=r N I M ~\ M-bit end-completion sensing radix-b full adders. Let’s suppose the latter 
compute propagate terms p. for each bit position i such that © B. (i=0...M- 

1). Each radix-b full adder can perform its sum operation either waiting for the valid 
carry-in or without waiting for it. These events can be identified computing the term 

PNW ~ PQ P\ Pi PM-\ ’ high if the radix-b full adder can 

proceed without waiting for an incoming carry-in, otherwise it is low. 

It is easy to understand that, supposing a uniform distribution of the operands, for a 

non-least significant radix-b full adder, the probability of having P = 1 is 
pr = {b — l)/^ . 



Note that the carry-out of the least significant radix-b full-adder is always known 
after the presentation of the operands and carry-in. Due to this, the probability of 
computing its carry-out bit without waiting for the valid carry-in is equal to 0. 



Therefore, in a N-bit adder the signals can be computed for all non-least 

significant radix-b full adders and their composition can be used to represent a (n-1)- 
bit binary number j . 
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The prohability of a given configuration of j is 

u( i) /i \ n-u( i)~l 

p. = pr • {I- pr) 



(1) 



where u(j) is the number of Is figuring in j. 

As it has been demonstrated in [6], the average number of cascaded radix-b full 
adders waiting for a carry can be obtained by (2) 

2(n-l) 

AVG^a='^+ Pj (2) 

j=o 



Where z(j) is the length of the longest string of Os figuring in j, and the 1 at the 
beginning is due to the fact that at least one radix-b full adder (theoretically the least 
significant one) is waiting for the carry-in. From numerical calculation of (2), it can 
be concluded that an adder implemented using the above principle will show an 
average delay much lower than logj(N) [6]. 

The average time needed for the adder to compute the carry out of all radix-b full- 
adders (Tcarry) obtained by (3), where and are the average delays of the 
least significant and of a non-least significant radix-b full-adders, respectively. As 
shown in Section 4, \j,p^ < x^^p^ then (3) has to be modified as (4) indicating that at 
least one of the more significant radix-b full-adders contributes to the average delay. 



CARRY 



— '^LSFA 



-f 



1=0 



\ 

*'^MSFA 



-'^LSFA aisfa 



( 3 ) 



CARRY - ^ MSFA +0,614123*T a]SFA (4) 

It is worth pointing out that Xp,^Y does not take into account the average time 
required to compute the sum bits of a radix-b full-adder when an incoming carry 
ripples into it for some bit positions after the computation of its carry-out. 

In the previous dynamic implementations of adders based on the SCLA technique 
[6,7], this amount of time has been considered as a constant [6]. Therefore, a further 
fixed delay is added to (3). However, in the implementations shown in [6,7], the 
amount of time needed to generate end-completion signal is large enough to suppose 
that during this time all sum bits have been calculated. 

In this paper, for the first time, the additional delay computing sum bits will be 
fully taken into account also considering its variability. In fact, additional delay is not 
constant, because it depends on how many bit positions an incoming carry ripples 
through. 

Let’s suppose that M=4, that is each M-bit block represents a radix- 16 full adder. 
Let’ s also suppose that the i-th radix- 1 6 full adder generates the carry-out completion 
earlier having Pj=0. If the (i-l)-th radix- 16 full adder has a high carry propagation, 
this carry will ripple through three bit positions in the i-th radix- 16 full adder. Thus, 
the sum bits change after the i-th full adder has flagged the validity of the carry-out. It 
can be easily verified that the probability of the above event occurring is 16/256. 
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Moreover, similar events can occur for the cases in which the i-th radix- 16 full adder 
has Pj=l and P 2 = 0 , or p = p^=l and Pj=0, or Pj= p = Pj=l and p„=0. 

A software routine has been huilt up to compute the probability of all the above 
cases occurrences. Then, taking into account that a rippling through 1-bit position at 
least will occur, the actual average addition time is 

'CaVO= 'CcARRY+'Cripl+ 24/256*( + 1 6/256*(T,i,3-'Cnpl) + '^GEND (5) 

Where is the time required to a rippling through K-bit positions (i.e. x^pj=0ns) and is a 
fixed time needed to obtain the end-completion signal. 



3 The Proposed Implementation 

We have investigated the possibility of efficiently implementing a 56-bit self-timed 
adder based on SCLA technique using AMS 0.6|dm Standard Cells [8]. A completely 
new appropriate architecture has been designed. In accordance with [6,7], M=4 has 
been chosen. 




Fig. 1. Top-level architecture of the implemented 56-bit adder 

Lowering the START signal starts adder activity. In the following we will suppose 
that the handshaking modules, which are not detailed here, assure that this event 
happens as soon as operands appear on the input lines. (Note that lowering START 
simultaneously at the operands arrival is the worst condition). 
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Fig. 2. 1 -bit full adder circuit. Its carry-out signal is low initialized (COUT) 




Fig. 3. Schematic diagram of the least significant variable-time radix- 16 full adder 

In Fig. 2, the 1-bit full adder circuit with low initialization phase is reported. Note 
that the 1-bit full-adder is in its initialization phase (i.e. its carry-out is low 
independently of operands) when the signal ISTART is high. As soon as the ISTART 
signal is lowered, the 1-bit full-adder is able to compute its carry-out and sum bits. 
The above scheme is used in both the least significant and non-least significant radix- 
16 full adders shown in Fig. 3 and Fig.4, respectively. It can be seen that to 
accommodate loads, some logic gates are either duplicated or strengthened. 

Two appropriate rippling chains are used to form END_CARRY and END_SUM 
signals, which flag the validity of the carry-out and sum bits, respectively. Both 
chains are high initialized and are realized by means of a proper number of AND-OR 
stages. Referring to the END_CARRY chain, its output is lowered (after ISTART 
becomes low) with a delay dependent on which Pj signal is low. Thus, if only Pj=0 
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END_CARRY is delayed by just one AND-OR stage. On the contrary, if only Pj=0 
END_CARRY is delayed by three AND-OR stages. Exhaustive post-layout 
simulations have demonstrated that, since the propagation delay of this AND-OR 
stage is slightly greater than that of the 4:1 multiplexer used in 1-bit full adder 
circuits, carry completion is always correctly flagged. 




Fig. 4. Schematic diagram of the non-least significant variable-time radix- 16 full adder 

The running of the rippling chain used to signal the validity of sum bits is 
analogous. There, the propagate signals influence the generation of the END_SUM 
signal in an opposite manner. The END_SUM signals (together with the 
END_CARRY of the most significant radix- 16 full adder) are used to determine the 
whole operation completion. Thus, the production of the END_SUM signals is 
anticipated to partially compensate for the delay introduced by NOR-NAND logic 
gates shown in Fig.l. This has been done considering p„ and Pj having the same 
weight (i.e. reducing the maximum rippling path of the END_SUM signal from 4 to 3 
AND-OR stages). 

The delay introduced by NOR-NAND logic gates shown in Fig.l can be actually 
considered as constant and corresponds to the above mentioned Tqend- 

To analyze the running of the adder, let’s suppose that at time t^ valid operands 
appear and the START signal is lowered. After the delay introduced by a XOR gate 
(Txor) all Pj signals are valid and after a further delay due to a 4-input NAND (Txand 4 ) 



all Pj^ signal are determined. 
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Observing Fig. 3, it can be seen that the ISTART signal is delayed with respect to 
START by means of a XOR gate. Therefore, ISTART falls when all the signals are 
valid. Then, the 1-bit full adders start to compute the sum and carry bits. The chain 
computing the validity of the carry-out (END_CARRY) and of the snm bits 
(END_SUM) start rippling. END_CARRY signal is lowered when the carry-out bit is 
valid. 

In the meantime, each radix- 16 full-adder in more significant position is able to 
know whether its carry-ont can be computed independently of carry-in or not. In the 
former case, the i-th radix- 16 full-adder starts the evaluation of its carries at the time 
'^o+'^xoR+'^Mux 2 > where the delay of the 2:1 mnltiplexer depicted in Eig.4. In the 

opposite case, the i-th radix- 16 full-adder is left in its initialization phase nntil a valid 
carry-in arrives. The (i-l)-th radix-16 full-adder will flag the validity of the carry-out 
bit lowering the END_CARRYj j signal. Thus, the latter signal will be selected by the 
above mentioned 2:1 mnltiplexer to start the evaluation of the i-th radix- 16 full-adder. 
It is worth pointing ont that, since X^or+X^^jj^ slightly greater than 'ixoR+'t^ANM’ 
glitches are avoided on the 2: 1 multiplexer output. 

When all END_SUMj and END_CARRY 55 become high GEND rises signaling 
operation completion. Then, the circuit is re-initialized by a rising edge of the START 
signal. 




780|o,m 

Fig. 5. Layont of the 56-bit proposed adder 



774|o,m 



The completion of the initialization phase is signaled by the subsequent falling 
edge of the GEND signal. As shown in Fig.l, a NAND gate instead of a typical 
Muller-C element is used to generate the GEND signal. This choice allows partial 
overlapping between the initialization phase of the adder and handshake signaling. 
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4 Results 

The circuit described above has been realized using the Austrian Mikro System p-sub, 
2-metal, 1-poly, 5 V, 0.6pm CMOS process [8]. The layout of the 56-bit adder for 
testing purposes is organized on 11 Standard Cells rows and it is reported in Fig. 5. It 
requires about 780pmx780pm silicon area. 

Digital and transistor level (using BSIM 3v3 device models) simulations have been 
performed. In order to calculate average addition times of least significant and non- 
least significant radix- 16 full adders, their exhaustive simulations have been carried 
out using worst delay models. From above simulations 1.54ns, T„gp^=2.04ns, 
T^jj3=2ns, T^p2=1.46ns, x^pj=0.8ns and Xp,ppjp,=0.12ns were measured. Thus, using (5) an 
average addition time of about 4.3ns is obtained. In order to confirm the theoretical 
results, the 56-bit adder was also simulated with a large number of random operands. 




Fig. 6. Energy dissipation and supply current during O-hOh-O addition. START signal 
falls at Ins and rises after operation completion re-initializing the circuit 

In accord with [3], power dissipation measurements have been performed in two 
specific cases: a) without carry propagation (minimum value), b) with the longest 
carry propagation path (maximum value). 

At Ins valid operands and START falling edge (falling time 200ps) are 
contemporaneously imposed on input lines. After operation completion the START 
signal rises re-initializing the circuit. This action corresponds to the precharge phase 
of a dynamic circuitry and for the 56-bit adder it requires about 1.7ns. Re-initializing 
the circuit, power dissipation of about 5mW and 9.5mW were measured, after 
operations of the cases a) and b) have been performed at 10 MHz, respectively. 
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Fig. 7. Energy dissipation and supply current during 0-l-FFFFFFFFFFFFFF-l-l addition. 
START signal falls at Ins and rises after operation completion re-initializing the circuit 

In Table 1, performance comparison between the proposed adder and recent 
efficient adders realized nsing dynamic logic gates is summarized. All data reported 
in Tablel is referred to laid out designs. Thus, interconnection parasistics were taken 
into acconnt. Since the adders described in [3] have been realized using l.Opm 
CMOS process, scaled values of the average addition time were added to Table 1 

Table 1. Performance comparison between the new adder and previously proposed adders. 
*Data reported in [3] is related to 32-bit adders, their average addition time for the 0.6pm 

process was estimated by means of delay relationship. 5V supply voltage was 

used for all designs 



Type of adder 


Area [pm^] 


Power [mW] 

min/max 

@10MHz 


Avg addition 
time [ns] 
Process 0.6jj.m 


Avg addition 
time [ns] 
Process l.Oum 


RCin [3]* 


274x2430 


39.9/41.8 


4.6 


10 


CLA in [3]* 


304x2567 


45.5/49.3 


5.8 


12.5 


BCL in [3]* 


1020x2265 


74.1/79.3 


4.2 


9 


New 56-bit adder 


780x780 


15.0/20.5 


4.3 


9.2 



From these results it can be conclnded that the new adder allows very high speed 
with very low power dissipation. The scaled average addition times, which can be 
used just as a rough indicator, allow us to claim that if the adders described in [3] are 
designed nsing the 0.6pm CMOS process their speed will be, however, lower than 
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that shown by the proposed 56-bit one. Nevertheless, their power dissipation is 
expected to be always greater than that required by our new Standard Cells design. 



5 Conclusion 

A new high-speed 56-bit adder for self-timed designs was presented. It has been 
realized using static logic 0.6|4m CMOS Standard Cells. The SCLA technique used 
allows a high-speed low-cost low-power circuit to be obtained. The proposed 
architecture can easily migrate to the newest low-voltage CMOS process and 
augmented advantages are expected. 
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The history of chipcards begun in the eighties of the last century. The first chips 
consist of a non-volatile memory (NVM), a serial I/O-channel and a finite state 
machine, offering the necessary security to enable only secure access to the stored 
data. The memory size was in the range of some 10 to 100 bytes, the clock rate 
around 200 KHz and the power consumption more than 100 mA @ 5V. 

Today, the chipcard market is grown up to a billion Euro business. Powerful 
dedicated 32-bit security controllers like the 88-Core of Infineon Technologies 
are ready to revolutionize chipcard based solutions. The 88-Core is a straitfor- 
ward RISC architecture with caches for data and instructions. In addition, the 
first time in chipcard world it offers virtual memory with an efficient transla- 
tion look-aside buffer which enables optimally organisational security. To realize 
also a quantum leap of physical security, several independent mechanism are 
implemented, for example hard encryption of memories. 

Chipcards based on this architecure, like the 88-family, support the 88-Core 
with ROM and NVM of up to 256 Kbytes each, as well as up to 16 Kbytes of 
RAM, a variety of powerful coprocessors like a DES accelerator and the Ad- 
vanced Crypto Engine (ACE), and of course a set of peripherals. The internal 
clock rate is up to 66 MHz, but nevertheless these chipcards are able to operate 
in a proximity contactless environment specified by ISO 14443, i.e. distance is 
less than 10 cm, but no battery is available and the transmitted power is much 
less than 10 mW. How is this possible ? 

The solution is not one great invention but the smart combination of a few 
techniques. The base is a leading egde quarter micron technology. Unfortunately, 
for cost reasons it have to be a standard process, but it is adjusted at the best 
tradeoff between performance and lowest power consumption. Next, the chips 
are developed with an unconventional design methodology. It enables the flexible 
integration of hard macros in a VHDL design. Of course, the hard macros are 
described in a dedicated high level language, too. Design parts having a certain 
regularity and a relatively high switching frequency are selected to become a hard 
macro. But what is the advantage of these hard macros ? They are designed in 
a switching current free design style called dual rail logic with precharge. This 
design style reduce power consumption dramatically and, if well designed, does 
not increase transistor count. 
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Last but not least, a revolutionary power balancing technique is introduced. 
The internal voltage regulator does no longer use a shunt transistor to balance 
the voltage but the power consuming circuit itself. The internal clock rate is the 
control value. Thus, the chip consumes only that power which is transmitted 
while wasting nothing. Using all that techniques Infineon Technologies is able to 
fulfill the hardest requirements for contactless applications. 




Dynamic Memory Design 
for Low Data-Retention Power 



Joohee Kim and Marios C. Papaefthymiou 



Advanced Computer Architecture Laboratory 
Department of Electrical Engineering and Computer Science 
University of Michigan, Ann Arbor, MI 48109 
{jooheek, marios}@eecs .umich. edu 



Abstract. The emergence of data-intensive applications in mobile en- 
vironments has resulted in portable electronic systems with increasingly 
large dynamic memories. The typical operating pattern exhibited by 
these applications is a relatively short burst of operations followed by 
longer periods of standby. Due to their periodic refresh requirements, 
dynamic memories consume substantial power even during standby and 
thus have a significant impact on battery lifetime. 

In this paper we investigate a methodology for designing dynamic mem- 
ory with low data-retention power. Our approach relies on the fact that 
the refresh period of a memory array is dictated by only a few, worst-case 
leaky cells. In our scheme, multiple refresh periods are used to reduce 
energy dissipation by selectively refreshing only the cells that are about 
to lose their stored values. Additional energy savings are achieved by 
using error-correction to restore corrupted cell values and thus allow for 
extended refresh periods. We describe an exact 0(n*^“^)-time algorithm 
that, given a memory array with n refresh blocks and two positive in- 
tegers k and I, computes k refresh periods that maximize the average 
refresh period of a memory array when refreshing occurs in blocks of I 
cells. In simulations with 16Mb memory arrays and a (72,64) modified 
Hamming single-error correction code, our scheme results in an average 
refresh period of up to 11 times longer than the original refresh period. 



1 Introduction 

Mobility imposes severe constraints on the design of portable electronic systems, 
particularly with respect to their power dissipation [1]. A popular approach to 
minimizing power consumption in portable devices is to employ a standby mode 
in which almost all modules are powered down. Large-density dynamic random 
access memory (DRAM) dissipates energy even during standby, however, due 
to its periodic refresh requirement. Such dissipation is of particular concern 
in the case of data-intensive applications, due to their large dynamic memory 
requirements. 

The charge stored in dynamic memory cells must be periodically refreshed 
to counter the corrupting effects of leakage currents. Due to local process per- 
turbations, each cell has different leakage currents, resulting in a distribution of 
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Fig. 1. Distribution of data-retention time for DRAM cells. Data adapted from [5]. 



data-retention times tjiET similar to the one shown in Figure 1. Conventional 
DRAMs use a single periodic refresh signal to restore the charge level in each cell 
capacitor to its original value. To prevent errors, refreshing must be done at the 
minimum refresh period Iref- This simple approach inevitably dissipates more 
power than necessary. First, since Iref is set with respect to the few “bad” cells, 
most memory cells are refreshed too early, thus dissipating unnecessary power. 
Second, due to its strong dependency with the leakage current, Iref is deter- 
mined at the highest operating temperature, resulting in unnecessary dissipation 
at lower operating temperatures. 

In this paper, we investigate the use of multiple refresh periods to eliminate 
the power associated with refreshing good cells too often. We also explore the 
use of error correcting codes (ECC) to further extend the average refresh period 
tREF- We give an exact 0(n^“^)-time algorithm for computing an optimal set 
of refresh periods for a memory array with n refresh blocks. Specifically, given 
positive integers k and I, our algorithm computes k refresh periods that maximize 
the average refresh period of the memory array, when memory is refreshed in 
blocks of I cells. The addition of ECC enables to further increase the average 
refresh period by correcting the errors occurring during the extended refresh. 
In simulations of a 16Mb memory array with a Single Error Correcting Code 
(SEC), our proposed multirate refresh scheme results in 11-fold increase of the 
average refresh period with respect to a conventional single-period refresh scheme 
without ECC. 

The remainder of this paper has six sections. Section 2 gives an overview 
of leakage-current induced errors and refreshing in DRAMs. Error correcting 
codes are briefly introduced in Section 3. The proposed multirate ECC-enhanced 
refresh scheme is described in Section 4. Our algorithm for the optimal selection 
of k refresh periods is described in Section 5. Section 6 presents simulation 
results from the application of our methodology to a 16Mb DRAM array. We 
conclude our paper in Section 7 with a brief discussion of future work. 
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2 DRAM Refresh 

Conventional single-transistor DRAM cells are composed of one transistor and 
a capacitor. Due to its simplicity, this structure can be used to fabricate high- 
density memories. Unlike static random access memory (SRAM), however, the 
stored charge is not retained by a continuous feedback mechanism, and leakage 
current reduces the stored voltage level over time. There are many known leakage 
paths in a DRAM cell. The junction leakage current from the storage node, 
which increases exponentially with the operation temperature, is known to be 
the major leakage mechanism [5]. 

Leakage current can be expressed using the simple empirical formula 

( 1 ) 

where Ea is the activation energy, k is the Boltzmann constant, T is the oper- 
ating temperature, and A is constant factor [5]. From this equation it follows 
that the leakage current is a function of the activation energy, which depends on 
fabrication processes such as ion implantation [6]. Due to local process fluctu- 
ations, activation energies vary among cells [7,8]. A study has showed that the 
log{tfiET) of the cells follows a bimodal distribution. The large main distribution 
is composed of good cells, and a small tail distribution is composed of bad cells 
[5]. To restore their intended voltage levels, DRAM cells need to be periodically 
refreshed at a period not exceeding their minimum data-retention time tuET- 

3 ECC for DRAM 

Error correcting codes are traditionally used in communications to battle the 
corruption of transmitted data by channel noise. Extra information is added to 
the original data to enable the reconstruction of the original data transmitted. 
The encoded data, or codewords, are sent through the channel and decoded at 
the receiving end. During decoding the errors are detected and corrected if the 
amount of error is within the allowed, correctable, range. This range depends on 
the extra information, parity bits, added during encoding. 

In DRAMs, saving data in memory corresponds to sending it down a noisy 
channel. Figure 2 shows the usage of ECC to correct errors in a memory system. 
Traditionally, ECC has been used to correct hard errors introduced during fab- 
rication, thus increase yield. It has also been used to correct soft errors caused 
by a-ray during operation [2,3]. Due to the random distribution of the error in 
DRAMs, HV parity code and Hamming code were most commonly used. 

With improvements in modern process technologies, the number of hard er- 
rors has decreased. The remaining few errors are usually dealt with by bypass- 
ing the row or column containing the hard error and using redundant rows or 
columns. Moreover, as the junction area in the device deceases due to scaling, 
the occurrence of soft errors has decreased [4]. Hence ECC is seldom used for 
general purpose DRAM in recent years. 




210 



J. Kim and M.C. Papaefthymiou 




Fig. 2. Data flow in ECC added memory 



4 ECC-Enhanced Multirate Refresh Scheme 

Power consumption in DRAM memories is given by the expression 



P — PArray PAux ? ( 2 ) 

where PArray is the power dissipated to read/write data and retain data, and 
Paux is the power consumption of auxiliary modules such as internal voltage 
generator. PArray is mainly due to the switching activity in the cell capacitors, 
bit lines, sense amplifiers and decoders and is hence frequency dependent. On 
the other hand, Paux is less frequency dependent. 




Fig. 3. Bit error rate versus tuEP in the presence or absence of error correction. 



Data-retention power can be decreased by extending tuEF- ECC technology 
can be used to correct the errors caused by not refreshing within the required 
time. Figure 3 shows simulated bit error rates (BER), defined as the number 
of errors over the total number of cells, for a 16Mb memory, with respect to 
tREF- The simulation was based on a leakage current distribution reported in 
[5]. The three graphs show results when the memory is operated with no ECC, 
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with a row-based single error-correcting code (denoted by SEC), and a row-based 
double error-correcting code (denoted by DEC). Since ECCs reduce the number 
of generated errors, a longer Iref is possible at any given error rate. The overall 
extent to which Iref can be prolonged depends on the tolerable error levels of 
the application for which the memory is used. 

In addition to the modified dissipation from the conventional sources, our 
ECC-enhanced approach incurs the dissipation of the ECC circuitry and the 
additional parity bits: 

P' = P'Array + PAux + PrCC + Pparity (3) 

The power consumption due to the ECC, Pecc, and to the parity bits, Pparity, 
are also frequency dependent and will thus offset the decrease in the array power 
P'Array ' ^be size of the ECC circuitry and associated parity bits depends on the 
choice of an ECC. 

The introduction of ECC is not guaranteed to extend Iref when a single 
refresh period is used. In the case of single error correction, for example, if two 
worst-case bits appear in a single codeword, they will still determine the extended 
Iref for the entire memory. Since the geometric location of the bad cells cannot 
be controlled, the resulting extended Iref can not be controlled either. 




J Refreshed at tp,Epo Refreshed at t na., | | Refreshed aft 

^BEFO ^ ^BEF1 ^ ^BEF2 

(a) conventional refresh (b) multi-rate block refresh 

Fig. 4. Multirate block refresh scheme. 



Using a collection of discrete refresh periods Iref to selectively refresh mem- 
ory blocks can increase the average refresh period and reduce power dissipation. 
The minimum Iref within the set can be set to the Iref without ECC. Mem- 
ory blocks comprising “bad” cells will still be refreshed at this rate. The longer 
refresh periods can be used to refresh the blocks with good cells. Once the re- 
fresh periods are selected, the variability is in the number of the memory blocks 
refreshed at a particular Iref- 

Figure 4 shows the application of our multirate scheme on a memory array. 
In this figure, our approach is applied at a fine granularity level by segmenting 
the refresh block, which conventionally is a row, into smaller blocks. In this case, 
the total dissipation is given by the equation 

P" = P'Array + P Aux + P'eCC + P'parity + PpA , ( 4 ) 
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where PpA denotes the energy dissipated for the partial activation of a row. 
In this approach, the additional energy required for refreshing smaller refresh 
blocks, partially activating a row for each different tpEF, is traded off to increase 
the average refresh period. Therefore, total savings depend on the size of the 
refresh block and the associated overhead. 

The implementation of the proposed ECC-enhanced multirate refresh scheme 
requires to store the refresh period Iref of each block in a refresh controller. 
The implementation of two refresh periods Iref for a memory without ECC has 
been reported in [12]. Additional circuitry is required for partial row activation 
if the refresh block is smaller than a row. The implementation of memory arrays 
with partial row activation to reduce word line capacitance has been reported 
in [13]. The information about the required Iref can be obtained after manu- 
facturing and can be stored in many forms. For example, it can be hard-wired 
using electrical fuses. Alternatively, it can be stored in re-writable memory ele- 
ments if post-fabrication modification is desired. During memory operation, the 
refresh controller uses the stored information to refresh blocks at their required 
tREF- If the multiple refresh periods are multiples of the minimum refresh pe- 
riod {Iref min), than refreshing can be achieved by simple consecutive refreshes 
at tREF MIN , activating only the refresh blocks that need to be refreshed and 
skipping the ones that do not. 



5 Algorithm for Selecting Optimal Refresh Periods 



The power consumption of a memory array under multirate refreshing is propor- 
tional to the sum of the power consumption of each block at its refresh period. 
Hence, total power consumption is given by the expression 



p = aY,n,- 

i=l 



1 

tREFi 



( 5 ) 



where Ni is the number of blocks that are refreshed at a refresh period Irefi, 
and A is a proportionality factor. It should be noted that power consumption 
depends on the size of the refresh block, the number of refresh periods, and the 
refresh periods themselves. 

Figure 5 demonstrates the basic idea behind the computation of an optimal 
set of refresh periods for a memory array. This graph shows the number of 
blocks that have a given retention time. Each vertical line corresponds to a 
refresh period. Between any two consecutive vertical lines, the total area under 
the curve gives the total number of blocks refreshed at the shorter of the two 
periods. The refresh periods must be chosen so that the sum of the individual 
area/period ratios is minimized. 

Figure 6 gives the pseudocode of our algorithm that computes an optimal set 
of refresh periods for a memory array with M rows of N bits, given the required 
refresh period for each refresh block of I cells {DB). For simplicity, our procedure 
is described for k = A refresh periods. The minimum refresh period is set to 
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Fig. 5. Optimal selection of multiple refresh periods 



1: STREF.OPT (DB, k=4) 

2: STREF = 0 
3: temp = oo 

4: for p — tjiEF MIN to tuEF MAX do 

5: A^[p] = Ap > number of refresh blocks refreshed at Irefp 

6: for q = Irffp to tREF max do 

7: N[q] = Aq > number of refresh blocks refreshed at tREFq 

8: for r = tREFq to tREF max do 

9: N[r] = Ar > number of refresh blocks refreshed at tREFr 

10: N[MIN] = Ntot ~ (A^N + N[q] + fV[r]) 

> number of refresh blocks refreshed at tREF min 

p ^ N[MIN] AfM , JVh] , N[r] 

t'REFMIN t'REFp t'REFq *KEFr 

if temp < P then 
temp = P 

STREF = {tREFMIN ,tREFp, tREFq, tREFr} 

end if 
end for 
end for 
end for 

return STREF 
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Fig. 6. Algorithm for finding optimal set STREF of block refresh periods. 



the single-period refresh period tuEFMiN- The nested loop structure iteratively 
assigns possible values to the three remaining refresh periods, computing the 
corresponding power of each assignment using Equation 5. For arbitrary k, there 
are k — 1 nested loops and the complexity of this scheme is where n is 

the number of refresh blocks in the memory array. 
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Fig. 7. Distribution of required Ihef for different size refresh blocks. 



6 Simulation Results 

We evaluated the effectiveness of our ECC-enhanced multirate refresh scheme 
using a (72,64) modified Hamming SEC [9] and a 16Mb DRAM whose Iref dis- 
tribution and electrical characteristics are reported in [5] and [10], respectively. 

Figure 7 shows the impact of refresh block granularity and ECC on the 
number of refresh blocks with short tREF- The two graphs on top give the 
number of blocks at each minimum refresh period for cell-based and row-based 
refresh, respectively, with no error correction. Row-based SEC greatly reduces 
the number of rows that require short Iref- The refresh periods can be extended 
even further by reducing the size of a refresh block from a 4608-bit row (4096 
data bits -I- 512 ECC bits) to a 72-bit codeword (64 data bits -b 8 ECC bits). 

Figure 8 shows the trend of power consumption with the introduction of a 
second refresh period Iref- As the second Iref increases toward the maximum 
refresh period shown in the distribution of Figure 7, power consumption de- 
creases below that of the single-refresh scheme at Irefaiin- Moreover, power 
dissipation decreases with the application of ECC and increase in the refresh 
granularity, since the fraction of blocks requiring short Iref decreases. 

Figure 9 shows the positive effect of multiple refresh periods on power dissi- 
pation. The dissipation of row-refresh with SEC is close to the ideal minimum of 
cell-refresh. The use of ECC results in significant power reductions with fewer pe- 
riods than without ECC. When two refresh periods are used, setting the second 
period at a multiple of the original Iref of 64ms [10] reduces the complex- 
ity of refresh control. Since variations of power dissipation are more gradual at 
short periods, selecting a refresh period of 64 x 11 = 704ms, which is slightly 
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Fig. 8. Power consumption versus period of second refresh cycle. 




Fig. 9. Effect of block size and number of refresh periods on power. 



smaller than the optimal 735ms, will increase the average refresh period (and 
thus decrease dissipation) by approximately 11 times. 



7 Conclusion 

This paper describes an ECC-enhanced multirate refresh scheme for low data- 
retention power in dynamic memories and presents an algorithm for selecting an 
optimal set of refresh periods. Simulation results with a 16Mb DRAM show that 
simple Hamming SEC can extend the average refresh period by up to 11 times 
over conventional single-cycle refresh. We are currently evaluating the energy 
efficiency of our scheme including the control and ECC overhead. We are also 
investigating efficient algorithms for computing optimal refresh periods. 
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Abstract. This paper describes the design of VHDL-based I.P. (Intellectual 
Property) cores using a Double-Latch clocking scheme instead of single-phase 
clock and D-Flip-Elops. This Double-Latch clocking scheme with two non- 
overlapping clocks provides several advantages in deep submicron 
technologies, i.e. a much larger clock skew tolerance, clock trees easy to 
generate, efficient clock gating and in some examples, such as an 8-bit 
CoolRISC microcontroller, a reduced power consumption. 



1 Introduction 

More and more I.P. cores (Intellectual Property) are available on the market. They are 
more and more “soft” cores written in VHDL or Verilog languages and synthesizable 
using Synopsys. One can find 32-bit RISC cores, DSP cores and 8-bit microcontroller 
cores, for instance, many 8051 cores. The main issue in such cores implemented in 
deep submicron technologies is the reliability. As they are synthesized using 
Synopsys, the soft core has to work for any customer with any constraint. It is 
therefore more difficult to guarantee that there is no timing violation in the 
synthesized “soft” core than with hard cores (layout provided to the customer). 
Furthermore, enhanced reliability generally increases the power consumption. It is 
therefore a major issue to increase reliability as well as to decrease power 
consumption. 



2 I.P. Cores 

As mentioned in the introduction, the main issue in the design of “soft” cores [1] is 
reliability. In deep submicron technologies, gate delays are smaller and smaller 
compared to wire delays. Complex clock trees have to be designed with clock tree 
generation tools linked with routers to satisfy to the required timing after the place 
and route step, mainly the smallest possible clock skew, and to avoid any timing 
violation. 

Furthermore, “soft” cores have to present a low power consumption to be attractive 
to the possible licensees. If the clock tree is a major issue to achieve the required 
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clock skew, requiring strong buffering, its power consumption could be larger than 
desired. The clocking scheme of I.P. cores is therefore a major issue, both for its 
functionality and for its power consumption. 

Today, most I.P. cores are based on a single-phase clock and are based on D-Flip- 
Flops. Another approach than the conventional single-phase clock with D-Flip-Flops 
(DFF) is presented in this paper. It is based on a Double-Latch approach with two 
non-overlapping clocks. This clocking scheme has been used for the 8-bit CoolRISC 
microcontroller I.P. core [2] as well as for other cores, such a DSP core and other 
execution units [3]. The advantages as well as the disadvantages will be presented. 



3 CoolRISC Microcontroller 

The CoolRISC is a 3-stage pipelined core (Fig. 1). The branch instruction is executed 
in only one clock [2], [4], [5]. In that way, no load or branch delay can occur in the 
CoolRISC core, resulting in a strictly CPI=1 (Clock Per Instruction). It is not the case 
of other 8-bit pipelined microprocessors (PIC, AVR , Scenix, MCS-251, Flip8051). It 
is known that the reduction of CPI is the key to high performances. 



1 clod 



k cycle 



1 clock cycle 



- 3-stage pipeline 

- no load delay 

- no branch delay 



Fetch 

& 

branch 



■-•v-v-v-v-v-v-v-v-v-v-v-v-. 



Fetch 



Execute 



store 

result 



Branch Arithmetic 

instruction instruction 



Fig. 1. CoolRISC Pipeline 



For each instruction, the first half clock is used to precharge the ROM program 
memory. The instruction is read and decoded in the second half of the first clock (Fig. 
1). A branch instruction is also executed during the second half of this first clock, 
which is long enough to perform all the necessary transfers. For a load/store 
instruction, only the first half of the second clock is used to store data in the RAM 
memory. For an arithmetic instruction, the first half of the second clock is used to 
read an operand in the RAM memory or in the register set, the second half of this 
second clock to perform the arithmetic operation and the first half of the third clock to 
store the result in the register set. 
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Another very important issue in the design of 8-bit microcontrollers is the power 
consumption. The gated clock technique [2], [4], [5] has been extensively used in the 
design of the CoolRISC cores (Fig. 2). 



data registers 




register 



To minimize the activity 
of a combinational 
circuit (ALU), registers 
are located at tine inputs 
of the ALU. They are 
loaded at the same time 
-> very few transitions 
in the ALU 

These registers are at 
the same time pipeline 
registers (a pipeline for 
free !) 

The pipeline 
mechanism 
does not result in a 
more complex 
architecture, but 
reduces the power 




Fig. 2. Gated Clock ALU 

The ALU, for instance, has been designed with input and control registers that are 
loaded only when an ALU operation has to be executed. During the execution of 
another instruction (branch, load/store), these registers are not clocked thus no 
transition occur in the ALU (Fig. 2). This reduces the power consumption. A similar 
mechanism is used for the instruction registers, thus in a branch, which is executed 
only in the first pipeline stage, no transitions occur in the second and third stages of 
the pipeline. It is interesting to see that gated clocks can be advantageously combined 
with the pipeline architecture; the input and control registers implemented to obtain a 
gated clocked ALU are naturally used as pipelined registers. 



4 Latch-Based Design of I.P. Cores 

Figure 3 shows the double-latch concept that has been chosen for such I.P. cores to be 
more robust to the clock skew, flip-flop failures and timing problems at very low 
voltage [6]. The clock skew between various 01 (respectively 02) pulses have to be 
shorter than half a period of CK. However, one requires two clock cycles of the 
master clock CK to execute a single instruction. It is why one needs, for instance in 
technology TSMC 0.25 pm, 120 MHz to generate 60 MIPS (CoolRISC with CPI=1), 
but the two 0i clocks and clock trees are at 60 MHz. 0nly a very small logic block is 
clocked at 120 MHz to generate two 60 MHz clocks. 
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Fig. 3. Double-Latch Clocking Schemes 



The design methodology using latches and two non-overlapping clocks has many 
advantages over the use of DFF methodology. Due to the non overlapping of the 
clocks and the additional time barrier caused by having two latches in a loop instead 
of one DFF, latch based designs support greater clock skew before failing than a 
similar DFF design (each targeting the same MIPS). 

With latch-based designs, the clock skew becomes relevant only when its value is 
close to the non-overlapping of the clocks (so half the period of the master clock). 
When working at lower frequency and thus increasing the non-overlapping of clocks, 
the clock skew is never a problem. It can even be safely ignored when designing 
circuits at low frequency. However, a shift register made with DFF can have clock 
skew problems at any frequency. 

This allows the synthesizer and router to use smaller clock buffers and to simplify 
the clock tree generation, which will reduce the power consumption of the clock tree. 

Example: A DSP core synthesized with a low-power library in TSMC 0.25 pm. The 
test bench A contains only few multiplication operations, while the test bench B 
performs a large number of MAC operations. The circuit was synthesised then routed. 
Table 1 shows the power consumption results for two different values of clock skew 
constraint given to CTGen, the first was done for a clock skew max of 3 ns, for the 
second one, a 10 ns clock skew max was chosen. Results show that, if the power is 
sensitive to the application program, it is also quite sensitive to the required skew: 
50% of power reduction from 3 ns to 10 ns skew. This shows that major power 
savings can be obtained with latch based circuits when the clock frequency allows to 
lighten the clock skew constaints. 



Table 1. Power consumption of the same core with various test benches and skew 



Skew 


Test bench A Test bench B 


10 ns 


0.44 mW/MHz 0.76 mW/MHz 

0.82 mW/MHz 1.15 mW/MHz 


3 ns 



Futhermore, if the chip has clock skew problems at the targeted frequency after 
integration, you are able with a latch-based design to reduce the clock frequency. It 
results in the fact that the clock skew problem will disappear, allowing the designer to 
test the chip functionality and eventually to detect other bugs or to validate the design 
functionality. This can reduce the number of test integration needed to validate the 
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chip. With a DFF design, when a clock skew problem appears, one has to reroute and 
integrate again. This point is very important for the design of a chip in a new process 
not completely or badly characterized by the foundry, which is the general case as a 
new process and new chips in this process are designed concurrently for reducing the 
time to market. 

Using latches for pipeline structure can also reduce power consumption when 
using such a scheme in conjunction with clock gating. The latch design has additional 
time barriers, which stop the transitions and avoid unneeded propagation of signal and 
thus reduce glitch power consumption. The clock gating of each stage (latch register) 
of the pipeline with individual enable signals, can also reduce the number of 
transitions in the design compared to the equivalent DFF design, where each DFF is 
equal to two latches clocked and gated together. 

Another advantage with a latch design is the time borrowing (Fig. 4). It allows a 
natural repartition of computation time when using pipeline structures. With DFF, 
each stage of logic of the pipeline should ideally use the same computation time, 
which is difficult to achieve, and in the end, the design will be limited by the slowest 
stage (plus a margin for the clock skew). With latches, the slowest pipeline stage can 
borrow time from either or both the previous and next pipeline stage. The clock skew 
only reduces the time that can be borrowed. An interesting paper [7] has presented 
time borrowing with DFF, but such a scheme needs a complete new automatic clock 
tree generator that does not minimize the clock skew but uses it to borrow time 
between pipeline stages. 



Clock 






1 


1 1 


Clock’ 1 


1 


1 




4 ^ 

Computation Time 






Fig. 4. Time Borrowing 



Using latches can also reduce the number of MOS of a design. For example, a 
microcontroller has 16*32-bits registers, i.e. 512 DFF or 13’312 MOS (using DFF 
with 26 MOS). With latches, the master part of the registers can be common for all 
the registers, which gives 544 latches or 6’528 MOS (using latches with 12 MOS). In 
this example, the register area is reduced by a factor of 2. 
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Fig. 5. Latch-based Clock Gating 



5 Gated Clock with Latch-Based Designs 

The latch-based design also allows a very natural and safe clock gating methodology. 
Figure 5 shows a simple and safe way of generating enable signals for clock gating. 
This method gives glitch free clock signals without the adding of memory elements, 
as it is needed with OFF clock gating. 

Synopsys handles very nicely the proposed latch-based design methodology. It 
performs nicely the time borrowing and seems to analyze correctly the clocks for 
speed optimization. So it is possible to use this design methodology with Synopsys, 
although there are a few points of discussion linked with the clock gating. 

This clock gating methodology cannot be inserted automatically by Synopsys. The 
designer has to write the description of the clock gating in his VHDL code. This 
statement can be generalized to all designs using the above latch-based design 
methodology. We believe Synopsys can do automatic clock gating for pure double 
latch design (in which there is no combinatorial logic between the master and slave 
latch), but such a design results in a loss of speed over similar DFF design. 

The most critical problem is to prevent the synthesizer from optimizing the clock 
gating AND gate with the rest of the combinatorial logic. To ensure a glitch free 
clock, this AND gate has to be placed as shown in Figure 5. This can be easily done 
manually by the designer by placing these AND gates in a separate level of hierarchy 
of his design or placing a 'don’t touch' attribute on them. 

Forcing a 'don’t touch' on these gates presents the drawback that this part of the 
clock tree will not be optimized for speed or clock buffering. Remark that the AND 
gate shown in Figure 5 represents a NAND gate followed by an inverting clock 
buffer. It would be interesting that the tool handles this gate in a special way to keep it 
in front of the latch clock input. Maybe by placing a specific attribute on it in such a 
way that it can recognize it as a clock gating gate, which forbid the optimizer to move 
logic between it and the latch, but still allows it to size the NAND and the clock 
buffer. 

The second problem we encountered was the fact that the Design Compiler found 
timing loops going through the clock enables. Assume two registers A and B, each 
register having its clock gated by an enable signal (Fig. 6). The enable signal of 
register A depends on the value of register B and the enable of register B depends on 
the value of register A. This is seen as an open loop by the tool, although the clocks of 
register A and B are defined in such a way that they cannot be T at the same time. 
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The condition on the clocks ensures that there is no open loop. Design Compiler 
seems not to take the non-overlapping of the clock into account when analyzing this 
loop, and we found no way to declare it in such a way that it is taken into account. 
This loop has to be cut with the 'set_disable_timing' command. This work around is 
not good, because it disables the timing optimization on some paths of the design that 
should have been optimized. In the above example, there is an important timing path 
from the clock input of latch A to the enable input of the AND gate of the clock 
gating of latch A. There is a similar path for latch B, and those two path overlap. If 
you place a “set_disable_timing” somewhere in the loop, you cut at least one of those 
paths. 




ClockA Clocks 



Fig. 6. Timing Loops 



6 Results 

A synthesizable by Synopsys CoolRISC-DL 816 core with 16 registers has been 
designed according to the proposed Double Latch (DL) scheme (clocks 01 and 02) 
and provides the estimated (by Synopsys) following performances (only the core, 
about 20’000 transistors) in TSMC 0.25 jam: 

- 2.5 Volt, about 60 MIPS (but 120 MHz single clock). It is the case with the core 
only. If a program memory with 2 ns of access time is chosen, as the access time is 
included in the first pipeline stage, the achieved performance is reduced to 50 MIPS 

- 1.05 Volt, about 10 (iW/MIPS, about lOO’OOO MIPS/watt 

The core “DFF-i-Scan” is a previous CoolRISC core designed with flip-flops [2, 4, 5]. 
The CoolRISC-DL “double latch” cores with or without special scan logic provide 
better performances. 
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Fig. 7. Power consumption comparison of “soft” CoolRISC cores 



7 Conclusion 

The I.P. CoolRISC core has been licensed to one company. Furthermore, the Double- 
Latch clocking scheme has been used for other cores and execution units, such as in 
[3]. It was shown that it was more reliable and mandatory at very low voltage. 
Furthermore, it provides a power consumption reduction compared to a single-phase 
clock scheme with D-Flip-Flops. 
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Abstract. This paper describes the architecture, functionality, and de- 
sign of NX-2700 — a digital television (DTV) and media-processor 
chip from Philips Semiconductors. NX-2700 is the second generation of 
an architectural family of programmable multimedia processors that sup- 
ports all eighteen United States Advanced Television Systems Committee 
(ATSC) [1] formats and is targeted at the high-end DTV market. 
NX-2700 is a programmable processor with a very powerful, general- 
purpose Very Long Instruction Word (VLIW) Central Processing Unit 
(CPU) core that implements many non-trivial multimedia algorithms, 
coordinates all on-chip activities, and runs a small real-time operating 
system. The CPU core, aided by an array of autonomous multimedia co- 
processors and input-output units with Direct Memory Access (DMA) 
capability, facilitates concurrent processing of audio, video, graphics, and 
communication-data. 



1 Architecture and Functionality of NX-2700 

NX-2700 is a DTV processor chip targeted to be used in high or standard- 
definition television systems, digital set-top-boxes, and other DTV-based appli- 
cations. A combination of hardware and software is used to implement the key 
DTV functionality. The chip features a very powerful general-purpose VLIW 
processor core (DSPCPU) and an array of DMA-driven multimedia and in- 
put/output functional units and co-processors that operate independently and in 
parallel with the DSPCPU, thereby making software media-processing of mul- 
timedia algorithms extremely efficient. As illustrated in the block-diagram in 
Figure 1, some key functional modules of the NX-2700 design are: 

— high-speed internal data-highway buses used for memory-data trans- 
fers as well as Memory Mapped Input Output (MMIO) control register 
read/write transactions, 

— a Main Memory Interface (MMI) unit that arbitrates accesses to the 
highway buses and manages the interface between the NX-2700 core plus its 
on-chip peripherals and the off-chip main memory (SDRAM), 

— a VLIW CPU core that uses a general-purpose VLIW Instruction Set 
Architecture (ISA) enhanced by powerful multimedia-specific instructions. 
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Fig. 1. Block diagram of NX-2700 



— a Transport-stream Processor (TP) that can gluelessly connect to as- 
sorted demodulator/decoder chips and perform PID-based filtering of 
MPEG-2 transport packets as per the ISO/IEC 13818-1 standard, 

— a slice-level MPEG-2 decoder that can decode the highest-resolution 
{main profile at high level) interlaced compressed video bitstream,^ 
multiple Audio In (AI) and Audio Out (AO) processors that can 
capture audio data from external world, can produce upto 8 channels of 
audio output, can decode AC-3 and ProLogic audio, and can also connect 
to external audio amplifiers, 

a Sony-Philips Digital Interface (SPDIF) that not only supports one 
or more Dolby-Digital AC-3 6-channel data streams and/or MPEG-1 and 
MPEG-2 audio streams as per Project 1937, but also produces IEC958- 
compliant outputs. 



^ The MPEG pipeline consists of a Variable-Length Decoder (VLD), a Run-Length 
Decoder (RLD), an Inverse Scan (IS) unit, an Inverse Quantizer (IQ), an Inverse 
Discrete Cosine Transform (IDCT) block, and a Motion Compensation (MC) unit. 
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a micro-programmable High-Definition Video Out (HDVO) unit 

that can mix multiple video and graphics planes and is capable of scaling 
vertically and horizontally pictures of the highest resolution (1920 x 1080) 
specified in the ATSC DTV standard,^ 

— a DVD Descrambler (DVDD) that supports both PC-based and stand- 
alone DVD players, 

a standard-definition-video-in (VI) subsystem that can capture a 
video stream directly from any CCIR656/601-compliant device, 
a standard-definition-video-out (VO) subsystem that can produce 
outputs in a PAL or NTSC format for driving monitors and a CCIR656- 
compliant format for recording in digital VCRs, 

a two- wire Inter-Integrated Circuit (IIC) interface for configuring 
and inspecting the status of various peripheral video devices such as digital 
multi-standard decoders, digital encoders, and digital cameras, 

— a Synchronous Serial Interface (SSI) that is specially designed to con- 
nect to an off-chip modem-analog-front-end subsystem, a network termina- 
tor, an A/D, a D/A, or a Codec through a flexible bit-serial connection and 
perform full-duplex serialization/deserialization of a bit stream from any of 
these devices, 

a Peripheral Component Interconnect (PCI) interface that allows 
easy communication with high-speed peripherals, 

a PCI External Input-Output (PCI-XIO) interface that serves as 
a bridge between the PCI bus and XIO devices such as ROMs and flash 
EEPROMs, thereby allowing a PCI-like transaction to proceed between NX- 
2700 and an inherently-non-PCI device on the PCI bus, 

— a system-boot-logic block that enables configuration of the various in- 
ternal registers via host-assisted or autonomous bootstrapping, 

— a JTAG controller that facilitates board-level testing by providing a bridge 
for asynchronous (to the NX-2700 system clock) data-transfer between the 
on-chip scannable registers and the external Test Access Port (TAP), and 

— a clock module comprising Phase Locked Loop (PLL) filter circuits and 
Direct Digital Synthesizer (DDS) circuits for generating assorted clocks for 
the memory, the core, and the peripherals. 



^ The HDVO unit contains a set of pipelined filters and video processing units that 
communicate with a set of memory blocks via a Crossbar interconnection network 
and perform functions such as horizontal scaling (polyphase direct and trans- 
posed filtering), horizontal filtering (multi-tap FIR filtering), panoramic zoom- 
ing (horizontal scaling using a continuously- varying zoom factor), vertical filter- 
ing scaling (de-interlacing and median filtering), 129-level alpha blending 
(to merge video and graphics planes), chroma keying (for computer-generated or 
modified overlays), table lookup (for color scaling and color modification, e.g., 
RGBl/2/4/8/16 to RGB32 conversion), color conversion (for YUV to RGB and 
vice versa), and horizontal chroma up/down-sampling. 




228 



S. Dutta 



2 VLSI Implementation Highlights 

Some characteristic features of the NX-2700 design, that deserve special mention, 
are as follows: 

— Multiple clock domains: NX-2700 being a multi-clock design, specially- 
designed synchronizers, allowing both fast-to-slow and slow-to-fast clock- 
domain transitions, are used at almost all clock-domain crossings, except 
where the data and/or control are guaranteed to be stable by virtue of the 
design. 

— Clock routing: Clock signals are routed all over the chip using a hierarchi- 
cal clock-tree network where specially designed buffers, that equalize clock 
skews, feed the clocks to the storage elements (flip-flops, memory, etc.). 

— Power management: Two different power-management schemes are fol- 
lowed in our design: dynamic clock gating and software-controlled static pow- 
erdown. 

— Silicon-debug aids: In order to aid in the debugging of the final silicon, we 
have implemented, in our chip, a SP Y mechanism that allows some important 
internal signals — the SPY signals — from each block to be observable at 
the top level at run-time. 

— GPIO functionality: We have designed special on-chip circuitry to en- 
able a large number of pins to operate as General Purpose Software Input 
Output (GPIO) pins and support functions such as infrared remote input, 
printer output, software-controllable switches in the system logic, software 
communication link, etc. 

— HDVO memory-system design: The large number of HDVO memories 
have been organized into individual rows of multiple banks with two wide 
Metal-4 (M4) wires for vdd and ground power distributions across the banks 
in each row. The rest of the memory (in each row) has been covered by a 
grounded M4 plate in order to minimize crosstalk by isolating and shielding 
the memory circuits from the signals routed in the next-higher Metal-5 layer; 
the grounded metal plate acts as a large decoupling capacitor. 

— Package considerations: The chip uses a Prolinx 352-pin Enhanced VBGA 
package that features two VDD (3.3V & 2.5V) and one GND ring. The ther- 
mal resistivity {Oja) of the package being 10 — 12°C/W, a power dissipation 
of 6W at the room temperature (25°G) can potentially raise the junction 
temperature to (25-1-6 x 12) = 97°G; therefore, to ensure correct operations 
at elevated temperatures, the timing and clock-speed analysis have been 
performed based on a worst-case operating temperature of 125°G. 

3 Design Tools 

We have used state-of-the-art Gomputer-Aided-Design (GAD) tools for the bulk 
of the design process. From the suite of external design-automation tools that 
we have used, the most notable ones are: 
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— Verilog (Cadence and OVI): for Register Transfer Level (RTL) designs, 

— Verilog-XL and NC- Verilog (Cadence): for Verilog simulations. 

— Design Framework II (Cadence): for design database and schematic entry, 

— Design Compiler (Synopsys): for logic synthesis, 

— PathMill, TimeMill, PowerMill (Synopsys/Epic): for transistor-level 
timing and power analysis, 

— Pearl ( Cadence ): for full-chip static timing analysis, 

— Fire & Ice (Simplex Solutions): for extraction of layout parasitics, 

— Chrysalis ( Chrysalis Symbolic Design, Inc.): for formal verification, 

— VeriSure (TransEDA Ltd.): for determining code and branch coverage, 

— Cells, Silicon Ensemble, Dracula (Cadence): for place-and-route and 
LVS/DRC (Layout Versus Schematic and Design Rule Check) tasks, 

— HSPICE (Meta Software): for transistor-level circuit simulation, and 

— Quickturn ( Quickturn Design Systems, Inc.): for emulation. 



4 Design Verification 

Some key aspects of our verification methodology have been: 

— using a combination of C, Verilog, C-shell scripts, and PERL routines to 
develop a hybrid testbed, 

— writing self-checking test programs in assembly or C (or a combination 
thereof) that are compiled and loaded in the external SDRAM using in- 
house software tools, 

— execution of the loaded binary on the Verilog model of the chip in order to 
program the block(s) under test in the desired mode via MMIO reads/writes, 

— development and use of integrated Verilog-based checkers for capturing and 
comparing the run-time outputs from the blocks against expected outputs, 

— automation of the regression runs, 

— using the MPEG decoder from the MPEG Software Simulation Group at 
Berkeley to provide expected results for various public-domain MPEG-2 con- 
formance streams and locally-generated synthetic stress streams, 

— development of a co-simulation (based on Verilog and G) environment for 
testing the HDVO sub-blocks, 

— development of a transaction-generator-based random-testing environment 
for block-level testing of the MMI, 

— development of a random-transaction generator for PGI verification, and 

— development of application tests for Quickturn-based emulation. 



5 Design Summary 

Table 1 presents some of the physical and electrical characteristics of the NX- 
2700 chip. 
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Table 1. Chip- level design parameters 



Parameter 


Value 


Technology 


0.25 Mm CMOS 


Metal layers used 


5 


Core supply voltage 


2.5 volts 


lO supply voltage 


3.3 volts 


System clock speed 


130 MHz 


Average power dissipation 


8 watts 


Design complexity 


18 million devices 


Package 


Prolinx Enhanced VBGA 


Package pins 


352 



6 DTV System Setup 

An example reference design platform, based on NX-2700, is shown in Figure 2. 
The Network Interface Module (NIM) incorporates the VSB demodulator and 
Forward Error Correction (FEC) chips and performs all of the necessary demod- 
ulation and channel-decoding functions from tuning to Transport Stream (TS) 
generation. Once the TS is generated, it is processed by NX-2700, optionally, 
along with a separately-received and decoded standard-definition video and its 
corresponding audio. In a typical digital video application, NX-2700 performs 
the following key functions: 




Fig. 2. NX-2700-based DTV receiver system 
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— transport stream capture, demultiplexing & PID filtering, 

— bitstream buffer management, 

— MPEG-2 video decoding, 

— AC-3 audio decoding, 

— clock recovery from the bitstream and video-audio synchronization, 

— 2-D graphics for closed captioning, user interface, program guide, etc., 

— display-video format conversion including horizontal and vertical scaling, 
conversions between interlaced and non-interlaced formats, and blending of 
graphics and video surfaces, and 

— processing of the CCIR656 video and its corresponding audio. 

Outputs from NX-2700 drive a TV monitor, a VCR, an audio power-amplifier, 
and/or an audio headphone. The on-chip data and control flow, for an example 
DTV application, is shown in Figure 3. 




Fig. 3. Data & control flow for example DTV application 



7 Conclusions 

NX-2700 is the second generation of an architectural family of programmable 
multimedia processors from Philips Semiconductors. The DTV market is still 
evolving throughout the world and so there is a clear need for a programmable 
DTV processor that will allow manufacturers to not only quickly develop ATSC 
television sets, set-top boxes, and PC-TVs, but also add new features and sup- 
port emerging services such as program guides, interactive advertising, and video 
telephony. NX-2700 provides all the above capabilities and can also act as an 
analog-, cable-, or ISDN-modern for use in fully-interactive services such as Web- 
browsing through the television set, video-on-demand, video teleconferencing, 
and interactive online gaming. The chip executes various digital-television ap- 
plications and different media-processing tasks through a mixture of hardware 
support and software control. NX-2700 borrows the CPU core, the instruction 
and the data caches, and some peripheral units from the TMllOO [2]-[5] design; 
however, several new peripheral units have been added in order to provide the 
key functionality for DTV applications. 
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Abstract. Advanced multimedia systems intrinsically have a high memory cost, 
making the design of high performance, low power solutions a real challenge. 
Rather than spending most effort on implementation platform dependent 
optimization steps, we advocate a methodology and tool that involve C-level 
platform independent optimizations. This approach is applied to an MPEG-4 
video decoder, leading to high performance, reusable C code. When mapped on 
(embedded) processors, this allows for lower clock rates, enabling low power 
realizations. 



1 Introduction 

Novel multimedia compression systems, like the object-based MPEG-4 standard [1], 
offer an interactive and user-friendly representation of information. However, the 
compact representation of audio, video and data comes with the cost of complex and 
data intensive algorithms. Increasingly, these new systems are also specified in 
software, next to the traditional paper specification. For MPEG-4, this reference code 
in C consists of several hundred thousands of lines of code spread over many files. 
Realizing a cost-efficient implementation from such a specification is a real design 
challenge. 

Additional difficulties, like late specification modifications and ever-changing 
market requirements, can require changing the implementation target. Moreover, the 
design has to be completed within the right time-to-market. Typically, 
hardware/software partitioning is one of the first steps in the design process, followed 
by platform dependent optimizations. In contrast, we describe the application of a 
high level, platform independent methodology, with the support of the ATOMIUM 
tool [2]. This approach allows a late choice of the target platform and provides more 
flexibility to deal with the problems described above. 

This paper first briefly summarizes the proposed optimization methodology and 
then explains the functionality of the ATOMIUM framework, which provides means 
to deal with the code complexity of modern multimedia systems and to support the 
optimizations. Subsequently, the design of an MPEG-4 natural visual (video) decoder 
illustrates the use of ATOMIUM and the impact of the platform independent 
optimizations on the memory complexity. Finally, we measure the performance 
increase of the optimized decoder on a PC platform and indicate the relation between 
the reduction of memory accesses and the resulting speed up factor. 
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2 C-Level Design 

Recent multimedia applications are almost by definition data dominated i.e. the 
amount of data transfer and storage operations are at least of the same order of 
magnitude as the amount of arithmetic operations [3]. This reflects itself in a 
dominant impact on the efficiency of the system realization: mainly the performance 
for software and the power and silicon estate for hardware realization. 



2.1 DTSE Methodology 

We have previously presented a Data Transfer and Storage Exploration (DTSE) 
methodology that provides a systematic way of reducing the memory cost [4]. It 
consists of a platform independent and a platform dependent part. The first part of the 
DTSE transformations is carried out at a platform independent level. These 
optimizations are hence not affected by possible changes in the implementation target 
and the resulting, optimized code (typically C code) is reusable. The target platform is 
chosen before the second, platform dependent part. This means that the outcome of 
the first design steps can be considered as reusable C-level IP (Intellectual Property). 
We show the results of the platform independent steps applied to the MPEG-4 video 
decoder. 



2.2 ATOMIUM Tool 

The huge C code complexity of multimedia systems makes the application of DTSE 
without additional help tedious and error-prone (see Section 0). To tackle this design 
bottleneck, the C-in-C-out ATOMIUM framework is being developed [2]. 

This framework consists of a scalable set of kernels and tools providing 
functionality for advanced data transfer and storage analysis, pruning and source-to- 
source code transformations. This paper focuses on the application of the first two 
items. 

Using ATOMIUM in a design involves three steps: instrumenting the program, 
generation instrumentation data and postprocessing of this data. 



Instrumentation. The input C files, together with ATOMIUM specific include files, 
are parsed and analyzed by ATOMIUM resulting in C-H- output files. These files have 
the same input/output behavior as the original files, but also include additional 
instrumentation code. Compilation with a regular C-H- compiler and linking with the 
ATOMIUM run time library creates an executable as shown in Eig 1 . 

Generation of Instrumentation Data. Running the previously generated executable 
with the (normal) input stimuli produces additional instrumentation data next to the 
normal output (Eig. 2). 

Postprocessing. The instrumentation data is then used for memory analysis and code 
pruning (see next sections). 
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Fig 1. Instrumenting the code with ATOMIUM prepares memory analysis or code pruning 




Fig. 2. Running instrumented code produces instrumentation data next to the normal output 



3 MPEG-4 Natural Visual Decoder 

MPEG-4 can be considered as the first true multimedia standard. It describes a scene 
as a composition of synthetic or natural audiovisual objects: audio, video and 
graphics. These objects are coded separately using the most efficient compression 
tool. 

A specific device will only need a subset of the MPEG-4 tools to fulfill the need of 
the application. A profile in MPEG-4 is the definition of such a subset. A level 
restricts the performance criteria, like the computational complexity of the profile tool 
set[l], [5]. 

The MPEG-4 standard is divided in several parts: audio, systems, visual, etc. Next 
to the “classical” video objects, called natural visual objects, synthetic visual objects 
(such as facial animation) are distinguished. The MPEG-4 (natural visual) video 
decoder is a block-based algorithm exploiting temporal and spatial redundancy in 
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subsequent frames. An MPEG-4 Visual Object Plane (VOP) is a time instance of a 
visnal object (i.e. frame). A decompressed VOP is represented as a gronp of 
MacroBlocks (MBs). Each MB contains six blocks of 8 x 8 pixels: 4 Inminance (Y), 1 
chrominance red (Cr) and 1 chrominance blue (Cb) blocks. Fig. 3 shows a simple 
profile decoder, supporting rectangular I and P VOPs. An I VOP or intra coded VOP 
contains only independent texture information, decoded separately by inverse 
quantization and IDCT scheme. A P-VOP or predictive coded VOP is coded using 
motion compensated prediction from the previous P or I VOP. Reconstructing a P 
VOP implies adding a motion compensated VOP and a texture decoded error VOP. 




Fig. 3. MPEG-4 simple profile natural visual decoding 



Next to a paper description, the MPEG-4 encoding or decoding fnnctionality is 
also completely specified through Verification Models (VM), normative reference 
code implementing an MPEG-4 subpart (audio, visual, etc). This software, written in 
C, is the reference for the encoding and decoding tools of that part of the standard. 



4 Pruning 

The VM software used as input for this paper is the FDIS (Final Draft International 
Standard) natnral visnal part [6]. Having working code at the start of the design 
process can overrule the tedious task to implement a system from scratch. 
Unfortnnately, the software specification contains many different coding styles and is 
often of varying quality. 

Moreover, the VM has to contain all the fnnctionality resulting in oversized C code 
distributed over many files. Table 1 lists the code size of the video decoder only: 93 
files (.h and .c source code files) containing 52928 lines (withont counting the 
comment lines). 

Table 1. ATOMIUM pruning reduces the code size with a factor 2.5. This allows manual code 
rearrangement that further reduces the code complexity 



Code version 


Number of files 


Number of lines 


Reduction 


FDIS 


93 


52928 


- 


Pruned 


26 


21340 


2.5 


Optimized 


20 


10221 


5.2 
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Fig. 4. ATOMIUM pruning extracts the required functionality from the source code based on 
the instrumentation data of the input stimuli 

A necessary first step in the design is extracting the part of the reference code 
corresponding to the desired MPEG-4 functionality of the given profile and level. 
ATOMIUM pruning, shown in Fig. 4, is used for this error-prone and tedious task. 
The tool identifies functions that are never used in the code (static pruning) and 
functions that are never called according to the input stimuli and parameters used to 
produce the instrumentation data (dynamic pruning). Consequently, ATOMIUM 
removes these functions and their calls. This implies careful selection of the set of 
input stimuli, which has to exercise all the required functionality. 

Applying automatic pruning with a testbench covering the MPEG-4 simple profile 
natural visual tools reduced the code to 40 % of its original size (2.5 x reduction, see 
Table 1). From this point, further manual code reorganization and rewriting is feasible 
and shrinks the number of lines to 19 % of the original (5.2 x reduction). This last 
reduction is obtained by flattening the hierarchical function structure and because the 
memory optimizations allow further simplification of the required functionality. 



5 Memory Analysis 

The C-level design approach requires an analysis of the data and transfer storage 
characteristics, initially for an early detection of possible implementation bottlenecks, 
subsequently to measure the effects of the optimizations. Traditionally, designers 
manually insert counter-based mechanisms. This is a valid, but time consuming error- 
prone approach. Profilers offer an alternative but use internally a flattened memory 
model and moreover, produce machine dependent results [7]. 

Postprocessing the instrumentation data with the ATOMIUM reporter generates an 
instrumentation database in a selectable output format. Using HTML as output offers 
an efficient and intuitive way of navigating through the memory access reports. The 
analysis results can be produced on array basis or on function basis. Table 2 lists the 
characteristics of the video bitstreams used as input stimuli for the creation of the 
instrumentation data. Akiyo is a typical head and shoulder sequence with little motion, 
Foreman is a medium motion sequence, whereas Calendar and Mobile is a highly 
complex sequence. When enabling rate control, the MPEG-4 encoder sometimes 
skips frames to obtain the specified bitrate. This explains the difference between the 
number of displayed VOPs and the number of coded VOPs (when the encoder 
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skipped a frame, the decoder displays the previous one). The results listed are for the 
MPEG-4 simple profile for GIF (358 x 288) and QCIF (176 x 144) image sizes. 



Table 2. Characteristics of the testbench video sequences 



Test case 


Number 

ofVOPs 


Rate 

Control 


Number of 

coded VOPs 


Bitrate 

(kbps) 


1. Akiyo QCIF 


81 


yes 


71 


53 


2. Foreman QCIF 1 


81 


none 


81 


95 


3. Foreman QCIF 2 


81 


none 


81 


96 


4. Foreman CIF 1 


81 


yes 


62 


104 


5. Calendar and Mobile QCIF 


81 


none 


81 


1163 


6. Foreman CIF 2 


81 


yes 


58 


104 


7. Foreman QCIF 3 


81 


yes 


81 


51 


8. Foreman CIF 3 


101 


none 


101 


274 


9. Foreman CIF 4 


101 


none 


101 


465 


10. Foreman CIF 5 


101 


none 


101 


764 



Analysis of the access reports of the automatically pruned code allows early 
identification of bottlenecks. Table 3 lists the most memory intensive functions 
together with the relative execution time spent in this function for the Foreman CIF 3 
test case. The timing results are obtained with Quantify [8] on a Pentium II 350 MHz 
PC (intentionally a low-end model since eventually embedded systems are targeted). 
As expected, memory bottlenecks popping up at this platform independent level also 
turn out to consume much time on the PC platform. The following list explains the 
behavior of the functions in Table 3: 

• VopMotionCompensate: Picks the MB positioned by the motion vectors from the 
previous reconstructed VOP. In case of halfpell motion vectors, interpolation is 
required. 

• BlockIDCT: Inverse Discrete Cosine Transform of an 8 x 8 block 

• VopTextureUpdate: Add the motion compensated and texture VOP. 

• CloneVop; Copies data of current to previous reconstructed VOP by duplicating 
it. 

• VopPadding: Add a border to previous reconstructed VOP to allow motion 
vectors to point out of the VOP. 

• WriteOutputImage: Write the previous reconstructed VOP (without border) to the 
output files. 

Only the IDCT is a computationally intensive function, all the others mainly involve 
data transfer and storage. The motion compensation and block IDCT together cause 
more than 40 % of the total number of memory accesses, making them the main 
bottlenecks. Focusing on these functions during the memory optimizations (i.e. reduce 
the number of accesses) is hence logical. 

The platform independent DTSE optimizations consist of global data flow, global 
loop and control flow transformations. These transformations reduce the number of 
memory accesses, improve the locality of the array accesses and decrease the amount 
of required memory [3], [4]. The listed results (Table 4) only include a part of the 
possible control and data flow and loop transformations. The reduction factor varies 
from 4.6 to 10.8 as the effect of some of the optimizations relies on the content of the 
bitstream. 
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Table 3. Motion compensation and the IDCT are the memory bottlenecks of the decoder. This 
analysis was done using the Eoreman CIE 3 test case 



Eunction name 


# accesses/frame 
(10*’ accesses/frame) 


relative 
# accesses 


relative time 
(%) 


V opMotionCompensate 


3.9 


25.4 


14.6 


BlockIDCT 


2.8 


18.0 


17.7 


VopTextureUpdate 


1.7 


10.7 


5.4 


Clone Vop 


1.2 


7.5 


5.0 


VopPadding 


1.1 


7.0 


6.4 


WriteOutputImage 


1.0 


6.2 


27.3 


Subtotal 


11.6 


74.7 


76.3 


Total 


15.5 


100.0 


100.0 



Table 4. The memory optimization result varies from a factor 4.6 to 10.8 



Test case 


# accesses/frame 
pruned 

(lO* accesses/frame) 


# accesses/frame 
optimized 

(lO* accesses/frame) 


Reduction 

factor 


1. Akiyo QCIE 


2.8 


0.3 


10.8 


2. Eoreman QCIE 1 


4.1 


0.6 


7.1 


3. Eoreman QCIE 2 


3.9 


0.6 


6.5 


4. Eoreman CIE 1 


11.4 


1.4 


8.1 


5. Cal & Mob QCIE 


4.8 


1.0 


4.6 


6. Eoreman CIE 2 


10.8 


1.3 


8.0 


7. Eoreman QCIE 3 


3.8 


0.5 


7.2 


8. Eoreman CIE 3 


15.5 


2.2 


7.2 


9. Eoreman CIE 4 


16.3 


2.5 


6.5 


10. Eoreman CIE 5 


17.0 


2.8 


6.0 



6 Evaluation of the Optimizations 



The implemented memory optimizations have a positive effect on the platform 
dependent level, both for hardware and software. At the HW side the reduction in 
power consumption evaluates the gain, at the SW side the speed up of the code 
determines the effectiveness. This speed up can then be used to lower the clock speed 
hence reducing the power consumption. 

The main part of the power consumption in data dominated applications is due to 
the memory [4]. The ATOMIUM instrumentation data together with the number of 
words and the width (in bits) of the used memory provides the necessary input to 
calculate a simple estimation of the power consumption 2: 

#Tmnsfers 



Pjr - ^Tr ^ 



Second 
= f{# words,# bits) 



( 1 ) 

( 2 ) 



Doing this calculation for every memory block yields an estimate of the total power 
dissipation. Reducing the amount of necessary memory size allows the choice of 
memory blocks with a lower energy per transfer. Combining this with a lower 



number of accesses (Table 4) leads to a lower overall power consumption of the 
optimized decoder. We have previously demonstrated this approach for HW by 
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designing the OZONE, an ASIC for wavelet-based MPEG-4 visual texture 
compression [9]. 



Table 5. The speed up factor of the video decoder varies between 6.0 and 19.5 



test case 


pruned 

(fps) 


optimized float 
IDCT (fps) 


speed 

up 


optimized integer 
IDCT(fps) 


speed 

up 


1. Akiyo QCIF 


27.3 


235.3 


8.6 


533.3 


19.5 


2. Foreman QCIF 1 


16.5 


95.2 


5.8 


187.9 


11.4 


3. Foreman QCIF 2 


16.7 


92.9 


5.6 


176.1 


10.5 


4. Foreman CIF 1 


6.0 


73.3 


12.1 


85.9 


14.2 


5. Cal (& Mob QCIF 


13.3 


28.9 


2.2 


80.1 


6.0 


6. Foreman CIF 2 


6.4 


76.6 


12.0 


89.9 


14.1 


7. Foreman QCIF 3 


18.0 


147.0 


8.2 


213.2 


11.8 


8. Foreman CIF 3 


4.3 


30.4 


7.1 


52.1 


12.1 


9. Foreman CIF 4 


4.0 


21.3 


5.3 


43.1 


10.8 


10. Foreman CIF 5 


3.8 


15.7 


4.1 


36.5 


9.7 



Note that the VM saves the decoded video sequence to disk to allow for an 
assessment of the compression results. In real life applications, the decoded results are 
written to the video memory. To avoid this inconsistency, the speed up is measured 
here without writing to disk. 

The speed improvement of the MPEG-4 video decoder due to the platform 
independent memory optimizations is listed in Table 5. The gain varies between 2.2 
and 12.0. The number of cache hits is a crucial factor of the performance [10]. 
Lowering the amount of memory and the number of accesses and improving the data 
locality increases their probability. This gain comes in addition to the (well-known) 
gain achieved by replacing the floating point IDCT by a computationally more 
efficient integer version (resulting in an overall speed up factor between 6.0 and 
19.5). This, together with the transformed control flow graph explains the speed 
increase of Table 5. Comparing these rates measured on a Pentium II, 350 MHz NT 
PC, with state-of-the-art results, like presented in [11] and [12] is not straightforward. 
The performance logically depends on the platform and the coding characteristics of 
the input sequences: the rate control method, the compressed bitrate, the quantization 
level etc. Unfortunately, insufficient details about the testbench in [11] and [12] are 
provided to make a detailed comparison, but globally our results achieve the same 
performance without the use of platform dependent optimizations. 

The decrease of the total number of array accesses can also be used as an 
indication of the speed up, without the need to do the actual mapping on a platform 
(see also Fig. 5). Of course, this thesis only holds as long as the application remains 
data dominated. A more precise estimate can be obtained by combining the decrease 
of number of a certain function with a factor to indicate its data dominance level. 
Calendar and Mobile illustrates the effect of having the main reduction of accesses in 
the data dominated functions and only a small part of reduction of accesses is in the 
computation dominated IDCT functionality (i.e. the application is no longer data 
dominated). The speed up factor, using floating point IDCT is hence smaller than the 
access reduction factor. Forman CIF 1 and 2 illustrate the opposite case. Here, the 
main part of the reduction of accesses is due to the IDCT function and hence the 
speed up is higher then the reduction of accesses. Consequently, the replacement of 
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the floating point IDCT by an integer one gives a proportionally larger speed 
improvement for Calendar and Mobile and a smaller one for Foreman. 



Effect of the memory optimizations towards SW 

20.0 

18.0 

16.0 

14.0 

12.0 



8.0 

6.0 

4.0 

2.0 
0.0 



Fig. 5. The reduction of the number of accesses is an indication for the speed up factor 



■Reduction — 
□Speedup Float 
□Speedup Int — 




akiyo QCIF, foreman foreman foreman CIF calendar & foreman CIF foreman foreman CIF foreman CIF foreman CIF 
53 kbps 30 QCIF 1,95 QCIF 2, 96 1,104 kbps mobile QCIF, 2,104 kbps QCIF 3, 51 3,274 kbps 4,465 kbps 5,764 kbps 

^s kbps 25 ^s kbps 25 ^s 25 ^s 1 163 kbps 30 25 ^s kbps 25 ^s 25 ^s 25 ^s 25 ^s 

^s 

Test Number 



7 Conclusions 

The MPEG-4 video decoder with a highly complex software specification has large 
data transfer and storage requirements. We have illustrated the use of the ATOMIUM 
tool for the automatic pruning and the advanced data transfer and storage analysis of 
the MPEG-4 video decoder specification. ATOMIUM gives designers the necessary 
support to deal with complex analysis and platform independent optimizations at the 
C-level. The effect of these optimizations is a reduction of the memory accesses with 
a factor 4.6 to 10.8. This optimized platform independent code results in a speed up 
between 6.0 and 19.5 when compiled on a PC platform. This performance increase 
creates the possibility to lower the clock frequency and hence to reduce the power 
consumption on (embedded) processors. 

C-level, platform independent optimization is hence an approach that allows to re- 
use the optimization effort for different target platforms. A prediction of the resulting 
performance gain on a specific platform, taking into account the degree and 
distribution of the data dominance of the application, is possible without the effort of 
an actual implementation. 
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Abstract. Exploitation of data re-use in combination with the use of 
custom memory hierarchy that exploits the temporal locality of data 
accesses may introduce significant power savings, especially for data- 
intensive applications. The effect of the data-reuse decisions on the power 
dissipation but also on area and performance of multimedia applications 
realized on multiple embedded cores is explored. The interaction between 
the data-reuse decisions and the selection of a certain data-memory ar- 
chitecture model is also studied. As demonstrator a widely-used video 
processing algorithmic kernel, namely the full search motion estimation 
kernel, is used. Experimental results prove that improvements in both 
power and performance can be acquired, when the right combination of 
data memory architecture model and data-reuse transformation is se- 
lected. 



1 Introduction 

The number of multimedia systems used for exchanging information is rapidly 
increasing nowadays. Portable multimedia applications, such as video phones, 
multimedia terminals and video cameras, are available. Portability as well as 
packaging, cooling and reliability issues have made power consumption an im- 
portant design consideration [1]. For this reason there is great need for power 
optimization strategies, especially in higher design levels, where the most signif- 
icant savings can be achieved. 

Additionally, these applications also require increased processing power for 
manipulating large amounts of data in real time. To meet this demand, two 
general implementation approaches exist. The first is to use custom hardware 
dedicated processors. This solution leads to smaller area and power consumption. 
However, it lacks of flexibility since only a specific algorithm can be executed by 
the system. The second solution is to use a number of embedded instruction set 
processors. This solution requires increased area and power in comparison to the 
first solution. However, it offers increased flexibility and mainly meets easier the 
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time-to-market constraints. In both cases, to meet the real time requirements, 
the initial application description must be partitioned and assigned to a num- 
ber of processing elements, which has to be done in a power efficient way. For 
multimedia applications realized in custom-processor platforms, the dominant 
factor in power consumption is the one related to data storage and transfer [2]. 
In programmable platforms though, the power consumed for instructions storage 
and transfers limits the dominant role of the power related to data storage and 
transfer [4]. 

The related work that combines partitioning of the algorithm and techniques 
for reducing the memory related power cost is relatively small [2] [3] [4] [5] . More 
specifically, a systematic methodology for the reduction of memory power con- 
sumption is presented in [2] [3] . According to this methodology, power optimizing 
transformations (such as data-reuse) are applied in the high level description of 
the application prior to partitioning step. These transformations mainly targets 
to reduction of the power due to data storage and transfer. Although, the effi- 
ciency of this methodology has been proved for custom hardware architectures 
[2] and for commercially available multimedia processors (e.g. Trimedia) [3], it 
does not tackle with the problem when an embedded multiprocessor architec- 
tures are used. The latter point has been stressed in [4] where the data-reuse 
exploration as proposed in [6] has been applied for uni-processor embedded ar- 
chitectures. The experimental results of [4] indicated that the reduction of the 
data memory-related power does not always come with a reduction of the total 
power budget for such architectures. Finally, a partitioning approach attempting 
to improve memory utilization is presented in [5]. However, this approach limited 
by the two-level memory hierarchy, does not explore the effect of the high-level 
power optimizing transformations, and its applicability is limited to a class of 
algorithms expressed in Weak Single Assignment Code (WSAC) form. Clearly, 
previous research work has not explored the effect on power, area, and perfor- 
mance of the high level transformations for the case of multiprocessor embedded 
architectures. In such architectures a decision that heavily affects power, area 
and performance is the one related to the data memory architecture-model (i.e. 
shared, distributed, share-distributed) to be followed. 

The motivation of this work is to investigate the dependencies between the 
decision of adapting a certain data memory architecture-model and the high-level 
power optimizing transformations. The intuition is that these two high-level de- 
sign steps, which heavily influence all design parameters are not orthogonal to 
each other. Consequently, in this paper we apply all possible data-reuse trans- 
formations [6] in a real-life application, assuming a LSGP partitioning scheme 
[11] and three different data memory architecture- models, namely Distributed, 
Shared, and Shared-Distributed. For all the data-memory architectures, the 
transformations’ effect on performance, area and power consumption is eval- 
uated. The experimental results prove that the same data-reuse transformations 
do not have similar effect on power and performance when applied for different 
data-memory architecture models. Thus, the claim that the application of these 
transformations in the first step can optimize power and/or performance, regard- 
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less the decisions related to data memory architecture that must follow is proved 
to be weak. Furthermore, the comparative study concerning power, performance 
and area of the three architectures and all the data reuse transformations in- 
dicate that an effective solution can be acquired from the right combination of 
data memory architecture model and data-reuse transformation. Finally, once 
more, the critical influence of the instruction power consumption on the total 
power budget is proved. 

2 Target Architectures 

We are working on multiple processor architectures each of which has its own sin- 
gle on-chip instruction memory. The size of the instruction-memory is strongly- 
depended on the code size executed by a processor. We name this scheme appli- 
cation specific instruction memory ( ASIM) . The instruction memory of different 
processors may have different size. Concerning the data-memory organization, 
application specific data memory hierarchy (ASDMH) is assumed. [2] [7]. Since 
we focus on parallel processing architectures, we explore ASDMH in combina- 
tion with three well-established data-memory architectures models: 1) distributed 
data-memory architecture DMA, 2) shared data-memory architecture SMA, and 
3) shared- distributed SDMA data memory architecture. For all the data-memory 
architectures models a shared background (probably off-chip) memory module 
is assumed. Thus, in all cases special care must be taken during the scheduling 
of accesses to this memory, to avoid violating data-dependencies and to keep 
the number of memory ports as small as possible in order to keep the power 
per access cost as small as possible. With DMA, a separate data-memory hi- 
erarchy exists for each processor (Fig. 1). In this way all memories modules of 
the memory hierarchy are single ported, but also area overhead is possible in 
cases of large amount of common data to be processed by the N processors. The 
second data-memory architecture-model (i.e. SMA) implies a common hierarchy 
of memory levels for the N processors (Fig. 2). Since, in the data-dominated 
programmable parallel processing domain, it is very difficult and very perfor- 
mance inefficient to sequentially schedule all memory accesses, we assume that 
the number of ports for each memory block equals the maximum number of 
parallel accesses to it. Finally, SDMA is a combination of the above two models, 
where the common data to the N processors are placed in a shared memory hi- 
erarchy, while a separate data memory hierarchy also exist for the lowest levels 
of the hierarchy (Fig. 3). For experimental purposes, we have considered target 
models with N=2 without any restriction about memory hierarchy levels. 



3 Data Reuse Transformations 

The fact that in multimedia applications the power related to memory transfers 
is the dominant factor in total power cost, motivate us to find an efficient method 
to reduce them. This goal can be done by efficient manipulation techniques of 
memory data transfers. For that purpose, we performed an exhaustive data reuse 
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Fig. 1. The distributed memory data-memory architecture model 




Fig. 2. The shared memory data-memory architecture model 



exploration of the application’s data. Employing data reuse transformations, we 
determine the certain data sets, which which are heavily re-used in a short 
period of time. The re-used data can be stored in smaller on-chip memories, 
which require less power per access. In this way, redundant accesses from large 
off-chip memories are transfered on chip, reducing power consumption related to 
data transfers. Of course, data reuse exploration has to decide which data sets 
are appropriate to be placed in separate memory. Otherwise, we will need a lot 
of different memories for each data set resulting into a significant area penalty. 
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Fig. 3. The shared-distributed data-memory architecture model 



Since our target architecture consists of programmable processors, we must 
take into consideration the power dissipation due to instruction fetching. Pre- 
vious work [4] forms a sign that this power parameter is a significant part of 
total system’s power, and thus, it should not be ignored. Also, it depends on 
both number of executed instructions and the size of the application code. Par- 
ticularly, the number of executed instructions determines how many times the 
instruction memory is accessed, while the code size determines the memory size. 
The cost function used for our data reuse exploration on all target architectures 
is evaluated in terms of power, performance, and area, taking into account both 
data and instruction memories. The cost function for power is: 

N 

Power Most = power Mostj (1) 

i=l 

where N is the number of processors and the i-th power estimate, power _costi 
is: 

power Mosti = E [Pr(wordJength{c),#words{c), fread{c),#ports{c)) 
ceCT 

+ Pu,{wordJength{c),#words{c), fn,rite{c), #ports{c))] 

+ Pi{instrjwordJength,codesize,f) (2) 

where c is a member of the copy tree (CT) [6] , Pr{-), Pw{-), and Pi{-) is the 
power consumption estimate for read operation, write operation, and instruction 
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fetch, respectively. For memory power consumption estimation we use the models 
reported in [2] and [8]. 

The total delay cost function is obtained by: 

Delay-Cost = max {^cyclesjprocessori} (3) 

where cycles jprocessori denotes the number of the executed cycles of the i-th 
processor (i = Also, the maximum number of cycles is the per- 

formance of the system. In order to estimate the performance of a particular 
application, we use the number of executed cycles resulting from the considered 
processor core simulation environment. Here, for experimental reasons we will 
use the ARMulator [12]. 

High level estimation implies that a designer should decide, which possible 
solution of a certain problem is the most appropriate. For that purpose, we will 
use the measure of power x delay product. This measure can be considered as 
a generalization of the similar concept from circuit level design and allows the 
designer performing trade-offs among several possible implementations. That is, 
the power efficient architecture is: 

Power-cf f -arch = Power -cost x Delay-Cost (4) 

The corresponding area cost function is: 

N 

Area-Cost = area-CosU (5) 

i=l 



with 



area-costi = Area{wordJength{c),^words{c),^ports{c)) 

ceCT 

+ Area{instr -word-length, codesize) (6) 

For the area occupied by the memories, Mulder’s model is used [9]. The cost 
function of the entire system is given by: 

Cost = a ■ Power -ef f -arch + b • Area-Cost (7) 

where a and b are weighting factors for area/energy trade-offs. 

4 Experimental Results-Comparative Study 

In this section, we perform extensive comparative study of the relation between 
data-reuse transformations and data-memory models, assuming the application’s 
partitioning. We begin with the description of our test vehicle and through its 
partitioning scheme, we will provide the experimental results after the applica- 
tion of the data-reuse transformations for all target architectures, in terms of 
power performance and area. 
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4.1 Demonstrator Application and Partitioning 

Our demonstrator application was selected to be the full search motion estima- 
tion algorithm [10]. It was chosen this algorithm because it is used in a great 
number of video processing applications. Our experiments were carried out using 
the luminance components of QCIF frame (144x176) format. Reference window 
was selected to include 15x15 candidate blocks, while blocks of 16x16 pixels 
were considered. The algorithm structure is described in Figure 4(a) which has 
three double nested loops. A block of the current frame (outer loop) is compared 
to a number of candidate blocks (middle loop). In the inner loop, a distortion 
criterion is computed to perform the comparison. 

Partitioning was done with the use of LSGP technique [11]. By applying this 
technique to a generalized for-loop structure, while assuming p partitions, the 
form of the partitioned algorithm becomes as shown in Fig. 5. 



for(x = 0; X < x-|— t) 

for(y = 0; y < f ; y-t-t) 
for(i = -p;i < p-l-1; i++) 
forQ = -p;j < p+1; j-t+) 
for(k = 0;k < B; k-t— t) 
for(l = 0;1 < B; 1 ++) 

if((B*x+i-kk) < 0 II (B*x+i-tk) > N-1 ||(B*y-kj+l) < 0 || (B*y-kj-tl)>M-l) 
\*conditional statement for the pixel of candidate block * \ 



Fig. 4. The full search motion estimation algorithm 



Do in parallel: 

Begin 

for(x=0; x< [ ;x-|— k) {sub-algorithm} 

x< [|fl;x-k-k) {sub-algorithm} 

for(x= [ ^ 1 ; x< [^] ;x-|— k) { sub-algorithm} 

End 



Fig. 5. The partitioned algorithm 



The semantic ”Do in parallel” imposes the parallel (concurrent) execution 
of p nested loops (i.e. sub-algorithm). From this above -code, it is apparent 
that the outermost loop is broken into p partitions, each of which is mapped to 
processor. The p processors execute the same algorithmic structure for different 
values of loop index x, i.e. different current blocks. Due to the inherent property 
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of algorithm, a set of data should be used by two consecutive sub-algorithms. 
In other words, data from (fc-l)-th processor should be used by fc-th processor 
{k = 1,2,3, - ■■ ,p). Our experiments were carried out assuming p = 2, meaning 
two partitions. Therefore, the loop index x has a range of nine. Due to QCIF 
format (144x176), the outermost index ranges from 0 to 8. The first and sec- 
ond processor execute the algorithm in parallel fashion, for loop index x ranging 
from 0 to 4 and from 5 to 8, respectively. We examined the impact of parti- 
tioning combined with 21 data reuse transformations on power, performance, 
and area. These transformations were applied after the partitioning process was 
finished in accordance with the previous section. They involved the insertion of 
memories for a line of current blocks (CB line), a current block (CB), a line of 
candidate blocks (PB line), a candidate block (PB), a line of reference windows 
(RW line) and a reference window (RW). These transformations were applied 
for all the three data-memory architecture modeles by taking into account each 
architecture’s characteristics. In Fig. The copy tree [6]of the full search motion 
estimation algorithm is identical for processor 1 and 2, where the dashed lines 
show the memory levels. Each rectangle contains three labels, where the number 
determines the applied data reuse transformations associated to memory hier- 
archy level. The remaining two labels determine the size of an PB and CB line 
or block, RW line or reference window. 

4.2 Experimental Results 

Comparisons among the three target architectures, in terms of power, perfor- 
mance, and area are shown in Fig. 6,8, and 9. 

Fig. 6 provide comparisons results of power consumption with respect to 
data-reuse transformations. The most power efficient design approach is the 
combination of SDMA and data-reuse transformations 4,5,15,19 and 20. In con- 
trary, almost all data-reuse transformations increase the total power when DMA 
or SMA is assumed. 

The effect of the data-reuse transformations on power consumption of data 
memory is shown in Fig. 7. As it can be seen, the largest effect is on the SMA, 
while the most efficient are the two other two data memory architecture models. 
Comparing Fig. 6 and 7, it is deducted that the power cost related to instruction 
memory have significant contribution on the total power budget, and in many 
cases overturns the power savings acquired in the data memory. Thus, the power 
component related to instruction-memory cannot be ignored during such high 
level power exploration. Fig. 8 shows that with DMA and SMA the data-reuse 
transformations barely affects performance, while with SDMA the transforma- 
tions have a more significant impact on performance. The greater variation in 
performance when the SDMA is assumed results from the size of instruction 
code related to control operations, specifying which memories of the hierarchy 
should be accessed. However, it can be generally concluded that the transfor- 
mations have similar effect on the performance for all data-memory architecture 
models (i.e. a certain transform positively/negativelly affects performance for all 
data-memory architecture models). Although this is true, the optimal transfer- 
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Fig. 6. Comparison results for total power. 




Fig. 7. The effect of data-reuse transf. on power of data memory. 



mations in terms of performance are different for each different data-memory 
architecture model. Specifically 4,5,6,18,19 and 20 for SDMA, 6,7,8,9,13,16,17 
and 18 for SMA and DMA are the near-optimal or optimal solutions in terms of 
performance. 

In Fig. 9 the effect of data-reuse transformations on area is illustrated. From 
that it can be inferred that each transformation influences area in almost iden- 
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Fig. 8. Performance comparison results of the target architectures. 




transformations 

Fig. 9. Area comparison results of the target architectures. 



tical manner for all data- memory architectural models. It is also clear that all 
transformations increase area, since they impose the addition of extra data mem- 
ory hierarchy levels. Moreover, for both DMA and SDMA area cost is similar 
for each data-reuse transformation. With SMA the area occupation is larger in 
all cases. This due to the fact that several memory modules are dual ported, to 
be accessed in parallel by the processing elements. In contrary, most memory 
modules are single ported and thus, they occupy less area. As it can be seen. 
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Fig. 10. Comparison results of the target architectures with respect to power x delay 
product. 



the SDMA is the most area efficient, since with this data- memory architecture 
model there are no memories in the hierarchy with duplicate data. 

In order to define which combination of data-memory architecture model 
and data-reuse transformation is the most efficient in terms of performance and 
power, we plot power x delayproduct (Fig. 10). We infer that there exist enough 
possible solutions, which can be chosen by the designer. These solutions are: 
the transformation 3 with SMA, transformations 15 and 17 with DMA and 
transformations 4,5,15,19 and 20 with SDMA. If also the area dimension is taken 
into account, the effective solutions are transformations 15 and 17, and, 4,5,15,19 
and 20 with DMA and SDMA, respectively. 

5 Conclusions 

Data-reuse exploration for the partitioned version of a real life application and 
for three alternative data-memory architecture models was performed. Applica- 
tion specific, data-memory hierarchy and instruction memory, as well as embed- 
ded programmable processing elements, were assumed. The comparison results 
prove that an effective solution either in terms of power or power and delay or 
power and delay and area, can be acquired from the right combination of data 
memory architecture model and data-reuse transformation. Thus, in the paral- 
lel processing domain for multimedia applications, the high-level design decision 
of adapting a certain data-memory architecture model and the application of 
high-level power optimizing transformations should be performed interactively 
and not in a sequential way (regardless the ordering) as prior research work 
proposed. 
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Abstract. A design methodology for reversible logic circuits is pre- 
sented. Any boolean function can be built using the three fundamental 
building blocks of Feynman. The implementation of these logic gates into 
electronic circuitry is based on c-MOS technology and pass-transistor de- 
sign. We present a chip containing single Feynman gates, as well as an 
application: a chip containing a fully reversible four-bit adder. We pro- 
pose a generalization of the Feynman gates: the reversible control gates. 



1 Introduction 

Classical computing machines using irreversible logic gates unavoidably generate 
heat. This is due to the fact that each loss of one bit of information is accompa- 
nied by an increase of the environment’s entropy by an amount /dog(2), where 
k is Boltzmann’s constant. In turn this means that an amount of thermal energy 
equal to kT log{2) is transferred to the environment, having a temperature T. 
According to Landauer’s principle [1] [2], it is possible to construct a computer 
that dissipates an arbitrarily small amount of heat. A necessary condition is that 
no information is thrown away. Therefore, logical reversibility is a necessary (al- 
though not sufficient) condition for physical reversibility. 

It is widely known that an arbitrary boolean function can be implemented 
into logic using only NAND-gates. A NAND-gate has two binary inputs (say A 
and B) but only one binary output (say P), and therefore is logically irre- 
versible. Fredkin and Toffoli [3] have shown that a basic building block which 
is logically reversible, should have three binary inputs (say A, B, and C) and 
three binary outputs (say P, Q, and R). Feynman [4] [5] has proposed the use 
of three fundamental gates: 

- the NOT gate, 

- the CONTROLLED NOT gate, and 

- the CONTROLLED CONTROLLED NOT gate. 

See Table 1. Together they form a set of three building blocks with which we can 
synthetize an arbitrary logic function. The NOT gate simply realizes P = NOT A. 
The CONTROLLED NOT satisfies P = A, together with 

If A = 0, then Q = B, else Q = NOT B . (1) 
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The CONTROLLED CONTROLLED NOT satisfies P = A, Q = B, together with 

If A AND B = 0, then R = C, else i? = NOT C . (2) 

The logic expressions of the CONTROLLED NOT are equivalent with 



P = A 

Q = A XOR B , 

where XOR is the abbreviation of the EXCLUSIVE OR function. The gate is thus 
the reversible form of the conventional (irreversible) XOR gate. The logic expres- 
sions of the CONTROLLED CONTROLLED NOT are equivalent with 



P = A 
Q = B 

R= {A AND B) XOR C . 



Table 1. Feynman’s three basic truth tables: (a) NOT, (b) CONTROLLED NOT, 
(c) CONTROLLED CONTROLLED NOT 
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(c) 



AB 


PQ 


0 0 


0 0 


0 1 


0 1 


1 0 


1 1 


1 1 


1 0 



A 


P 


0 

1 


1 

0 



The CONTROLLED CONTROLLED NOT is a universal primitive [6]. This means 
that any boolean function of any finite number of logic input variables can be 
implemented by combining a finite number of such building blocks. 

In spite of the fact that the CONTROLLED CONTROLLED NOT is sufficient, we 
will use all of the three Feynman blocks for synthesis. The NOT block is trivial, 
as we make use of dual electronics. This means that any boolean variable A 
is represented by two electric signals: A and A = NOT A. Therefore, a simple 
metal cross-over is sufficient to realize the NOT function: P being connected 
to A, while P is connected to A. Function (1) leads to the implementation of 
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Figure la, whereas function (2) is realized as in Figure lb. The latter circuit 
is deduced from the former. In the four sides of the square of Figure lb, the 
single switches from Figure la are replaced either by a series or by a parallel 
connection. This extrapolation is inspired by conventional (restoring) digital 
electronics, where a similar extrapolation of the NOT gate (Figure 2a) leads to 
the NOR gate (Figure 2b) and to the NAND gate (Figure 2c). 





Fig. 1. Basic square circuits: (a) Q = A XOR B, (b) R = {A AND B) XDR C. Here a 
switch is in the state indicated by the arrow if the logic variable next to it equals one 



2 Electronic Implementation 

Within the framework of the European multiproject- wafer service Europractice, 
silicon prototypes of some circuits have been fabricated, in the Alcatel Microelec- 
tronics n-well c-MOS 2.4 fxm technology. The layout is designed with Cadence 
DesignFramieWork II 4.3.4 full-custom software. The n-MOS transistors have 
length L equal to 2.4 /xm and width W equal to 2.4 /xm, whereas the p-MOS 
transistors have L = 2.4 /xm and W = 7.2 /xm. The p-MOS transistors are chosen 
three times as wide as the n-MOS transistors in order to compensate for the fact 
that holes are three times less mobile in silicon than electrons. 
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Fig. 2. Conventional c-MOS logic gates; (a) the NOT gate, (b) the NOR gate, (c) the 
NAND gate 



For the implementation of the on-off switch, we use the transmission gate, 
consisting of two MOS transistors in parallel, i.e. one n-MOS transistor and one 
p-MOS transistor. This leads to the following number of transistors: 

— the NOT gate: no transistors 

— the CONTROLLED NOT gate: 8 transistors 

— the CONTROLLED CONTROLLED NOT gate: 16 transistors. 

We remark that not only these building blocks can be used to construct reversible 
circuits. Indeed, in the past some simple circuits have been implemented using 
another fundamental building block having hexagonal symmetry [7] [8] [9], but 
using, however, 24 transistors. 

Figure 3 shows a prototype of the CONTROLLED NOT gate and the CONTROLLED 
CONTROLLED NOT gate. We stress that they have no power supply inputs. Thus 
there are neither Vdd nor Vss nor ground busbars. Note also the complete absence 
of clock lines. Thus all signals (voltages and currents) and all energy provided 
at the outputs originate from the inputs. As a consequence, an inevitable signal 
loss occurs at the outputs. However, measurements indicate that the loss in our 
chip is always smaller than 10 mV for an input signal of 2, 3 or 4 volts [8]. 

This circuit is an example of dual-line pass-transistor logic, as opposed to 
conventional restoring logic. In conventional c-MOS, output pins are fuelled from 
a Vdd and a Vss power line. See e.g. the conventional c-MOS gates in Figure 2. 

3 Application 

Higher levels of computation particularly need the implementation of the full 
adder. This can e.g. be realized with the help of two half adders. The latter 
circuit can easily be built from one CONTROLLED NOT block and one CONTROLLED 
CONTROLLED NOT block, as shown in Figure 4a. The output A XOR B provides 
the sum bit S, whereas the output A AND B provides the carry-out bit Co- 
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Fig. 3. Microscope photograph of the CONTROLLED NOT and the CONTROLLED 
CONTROLLED NOT gate 

It is well known that a full adder can be constructed from two half adders 
and one DR gate. However, as one can expect, this is not the most economic 
implementation. Figure 4b gives a far more efficient design [4] [5] [10]. Not only 
we have here four blocks instead of six, but (and this is even more important) 
we have only four dual input lines. The circuit consists of 2 x 8 + 2 x 16 = 48 
transistors. 

Figure 5 shows a prototype 4-bit adder chip. It contains a total of 4 x 48 = 
192 transistors. It sums a 4-bit number (Hq, Ai, A 2 , H3) and a 4-bit number 
{Bq, Bi, i?2, B 3 ). The result is a 4-bit number (^o. Si, S 2 , S 3 ). The first carry-in 
bit, i.e. {Ci)o is set to zero, whereas the carry-over bits ripple from one full adder 
to the next, the last carry-out ((70)3 yielding the overflow bit. From Figure 4b, 
we see that not only the sum 

S = A + B (3) 

is calculated, but that another output recovers the value of input A : 

S = A + B 

T = A . (4) 

This is no surprise, as eqn (3) is not reversible (the value of S being insufficient 
for recovering both A and B), whereas result (4) is computationally reversible: 

A = T 

B = S-T . 



See Figure 6. 

Besides A, S and Co of Figure 4b, the full adder also provides a fourth output 
(i.e. A XOR B). This result is considered as ‘garbage’. The garbage outputs are 
the counterpart of the preset inputs. 
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Fig. 4. Block diagram of Feynman reversible adders: (a) the half adder and (b) the full 
adder 




Fig. 5. Microscope photograph of a prototype c-MOS reversible Feynman four-bit 
adder 
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Fig. 6. Two different ways of adding two numbers: (a) irreversible way, (b) reversible 
way 



A very important advantage of the Feynman gates, with respect to the ‘hexag- 
onal’ gates, is the fact they fulfil two conditions simultaneously: 

— the backward truth tables are equal to the forward truth tables (or in other 
words: the gates are identical to their reverse gates) and 

— the electronic implementation is identical to its mirror image. 

As a result, circuits, 

— that are intirely built from Feynman gates and 

— where these building blocks are interconnected in a symmetric way, 

can compute in both directions. The inputs and the outputs of such circuits 
are indistinguishable. There is no need for additional hardware to implement 
‘electronic reversibility’ [8]. The circuit can equally perform the same calculation 
from left to right as from right to left. 

An important class of such circuits is formed by the garbageless circuits 
proposed by Fredkin and Toffoli [3]. Indeed, the ‘undo’ subcircuit is the mirror 
image of the ‘do’ part, whereas the ‘spy’ circuit is its own mirror image. An 
example is shown in Figure 7: a garbageless one-bit full adder. 




Fig. 7. Microscope photograph of a prototype garbageless Feynman one-bit fnll adder 
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4 Adiabatic Addressing 

The described reversible c-MOS circuits are particularly suited for adiabatic 
addressing [8] [11] [12] [13] [14]. The dynamic behaviour of a 1-bit adder is ex- 
amined by a change of the input {A,B,Ci). If e.g. we change {A,B,Ci) from 
(0,0,1) to (1,0,1), by raising Va from —cp to +(p (See Figure 8a), then changes 
happen in a non-adiabatic way, just like in conventional c-MOS. Energy dissipa- 
tion for charging an output capacitor C equals ^C(2(p)^. The problem can easily 
be circumvented by introducing an intermediate state, where all three voltages 
^A, Vb, and Vci equal zero (and thus the logic variables A, B, and Ci are unde- 
termined). See Figure 8b. In the first part of the switching process all capacitive 
loads are discharged, sending their stored energy to the voltage sources at the 
input pins; in the second part of the switching process the input voltage sources 
recharge the output capacitors. 



V 1 




Fig. 8. Input voltages as function of time: (a) conventional addressing, (b) adiabatic 
addressing 



Spice simulations confirm that the quasi-adiabatic charging indeed con- 
sumes less energy than the conventional one. The energy consumed for one switch 
and back, we call the ‘energy dissipation per computational cycle’ AW. At high 
speed, we find the limit value ^ C(2(p)^. For sufficiently slow clocks, the adiabatic 
switching is economic. However, AW decreases more slowly with addressing time 
T than the r“^-law predicted by the Athas equation [11]. This is caused by the 
fact that a transistor is not ohmic [8]. Even more disturbing is the saturation 
of AW, for very slow switching, at a value approximately equal to ^C{Vty, 
where Vj is the threshold voltage of the transistors. Indeed, transmission gates 
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Fig. 9. Experimental transient signals on oscilloscope screen: input bit B and output 
bit Co of a full adder 



are turned off as long as both transistors have a gate-source voltage inferior to 
the threshold voltage. Figure 9 illustrates this behaviour for t = 50 ^s and Lp = 
2 V (and Vt = 0.9 V). Per cycle of 100 ps, the 1-bit adder recovers 5.3 pJ, while 
dissipating only 0.6 pJ of non-recoverable energy. 

If we aim to decrease the energy dissipation further, we have to lower either 
the threshold voltages Vt of the transistors or their input capacitances. This can 
be performed in three ways: 

— either we lower Vt by applying an appropriate bias to the bulk, 

— or we go to next generation (i.e. submicron) silicon process, 

— or we choose a drastically different technology, e.g. silicon-on-insulator. 

5 Further Development 

We can easily generalize Feynman’s gates toward a broad class of reversible 
logic gates, we call control gates. Such a gate has w inputs {Ai,A 2 , ■■■,A^) and 
w outputs {P\,P 2 , ■■■,Pw), satisfying 

Pi= Ai for alH G {1, 2, ..., m} 

Pi = ■■■, Am) XOR Ai for all i G {m -I- 1, m -I- 2, ..., w} , 

with 1 < m < w and with /j arbitrary boolean functions. The m inputs ( 2 I 1 , 
A 2 , ■■■, ^m) are called the controlling inputs; the w — m inputs (Am+i, Am+ 2 , 
...,Auj) are called the controlled inputs. The number w is the width of the gate. 

The implementation of a control gate is straightforward: the m control lines 
are mere electric wires from input to output, whereas the remaining w — m 
outputs Pi are generated from ‘squares’ like in Figure 1, the corresponding inputs 
Ai being preset to logic 0. A submicron d-bit carry-look-ahead adder, entirely 
based on this principle, is in preparation. 
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6 Conclusion 

We have demonstrated a way to build up reversible boolean functions by means 
of the fundamental building blocks proposed by Feynman. The electronic imple- 
mentation of these logic gates is based on dual-line pass-transistor logic. Apply- 
ing reversible logic introduces an overhead of circuitry, whereas the threshold 
behaviour of MOS prevents to approximate the adiabatic limit. Therefore our 
architecture is not competitive, but is particularly useful for studying the fun- 
damentals of digital computation. We have applied our design methodology to 
some basic circuits. In particular a four-bit adder has been demonstrated. 
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Abstract. In this communication adiabatic and conventional gates with a 
different fan-in are modeled and analytically compared. The comparison is 
carried out assuming both an assigned power supply and setting it to minimize 
power consumption. The analysis leads to simple expressions, which allow to 
understand how the power advantage of adiabatic logic changes by increasing 
the fan-in of the implemented gate. The analytical results were validated by 
means of Spice simulations using a 0.8 |lm CMOS technology. 



1 Introduction 

Power consumption reduction has become a key design aspect in ICs [1], because of 
the wide diffusion of portable equipment. Using the conventional CMOS design style, 
the most effective way to reduce power dissipation is to lower the supply voltage [11]. 

Recently, the Adiabatic Switching approach to reduce power dissipation in digital 
circuits was proposed [2]. It is to be used and verified in many digital applications 
[5],. [6], [7], [8]. A time-varying clocked ac power is used to slowly charge the node 
capacitances, and then partially recover the energy associated to that charge by slowly 
decreasing the supply [2], [3], [4], 

Even if the interest in adiabatic logic design style and architectures is growing, 
comparisons between adiabatic and conventional styles are analytically carried out 
only in the simple case of an inverter [2], [3], [4] , [9], 

In this communication, we analytically evaluate the power reduction of adiabatic 
logic with respect to conventional one considering gates with a different fan-in. More 
specifically, we analyze the inverter, NAND2, NAND3 and NAND4 gates. The 
resulting expressions are simple, hence they allow to understand how the advantage of 
adiabatic style changes by increasing the fan-in of the gate and for different load 
capacitances. 

The comparison is carried out both assuming an assigned supply, as in the case of 
logic levels compatibility requirement, and setting it to minimize power consumption 
for a given speed requirement. 

The validity of the used model is tested by Spice simulations on NAND2, NAND3 
and NAND4 gates designed with a 0.8 pm technology. 
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2 Adiabatic Gates Advantage over Conventional 



2.1 Inverter 



The adiabatic inverter is shown in Fig. 1 [2]. It is implemented by using two 
transmission gates and a power clock, V^, whose maximum amplitude is equal to 
and rise time equal to T. It has differential inputs and outputs, and is loaded by an 
external capacitance Q. 




OUT 



Fig. 1. Adiabatic inverter. 

To evaluate the energy consumption, we approximate the transmission gate in the 
ON and OFF state to the linear circuits in Figs. 2a and 2b, respectively. 



R 




Fig. 2. Linear equivalent circuit of 

(a) transmission gate in the ON state, (b) transmission gate in the OFF state. 

Without loss of generality, we assume a linear ramp clocked power supply 
waveform. However, all results can simply be extended to a general power clock 
waveform by multiplying them by a proper shape factor, <^, which only depends on 
the clock waveform [2-4]. Moreover, in the following the transmission gate 
parameters will be evaluated by assuming symmetrically sized transistors (i.e., 
(W/L)=2(W/L)J and minimum sized NMOS devices. Analysis of the resulting RC 
circuit, considering both the energy wasted during charge and recovery, leads to the 
following expression of the adiabatic energy, wasted in a cycle (i.e., for a 

single computation) 

E,„,»<,r=2|(C + Q)Vi (1) 

where the equivalent resistance of the transmission gate is R={flnCox{WIL)n{VDD- 

The average energy wasted by a symmetrically designed conventional inverter 
(i.e., (W/L)=2(W/L)J with minimum sized NMOS transistor, £ is roughly equal 
to 0.5*(l/4)*(C+CjVgg, obtained by approximating its intrinsic capacitance to that of 
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a transmission gate', and assuming input values statistically independent from the 
others with an equal prohability to be zero or one (hence, the resulting switching 
activity is 14). Hence, to compare the power dissipation of adiabatic and conventional 
logic for an assigned supply, let us define the parameter as 



NOT 



^ ad, NOT 



com , NOT 



16- 



1 + a 



Vpz 

K 



-2 



( 2 ) 



where (x=CJC is the load capacitance normalized to the parasitic capacitance of the 
transmission gate, and T =TV ,p.C g^(y^!L)JC is the normalized rise time of the 
adiabatic power clock, V^.To minimize power consumption for a given speed 
requirement, the supply of the conventional inverter is set to the minimum value, 
^DD„pc„m,’ which satisfies its propagation delay constraint [9] 



Vnn = Vr 

DD,op,com 1 



1 + 



C + C, 



l^n^OX 



ly 



V T 

y T^ PD 



(3) 



where we assumed 4/i„Cox(3k/L)„y7-Tpo/(CH-Cz,)»l. To minimize adiabatic energy 
consumption, we have to set Vdd simply equal to AVj [2-4]. Let us define £„„„pjvo 7 - 
^aj.opNOT the energy wasted by the conventional and adiabatic inverter obtained 
setting an optimized supply voltage given by eq. (3) and equal to 4Vt, respectively. 

Let us consider the case of supply optimized to meet a speed requirement with 
minimum power dissipation. To compare the energy dissipation at a defined speed, 
we set the transition period, T, of the adiabatic inverter equal to the propagation delay 
of the conventional gate (i.e., T=Tpd)- To carry out the comparison in this optimized 
case, let us define the parameter Foj, not as 



op, NOT 



“'ad ,op,NOT 



con, op, NOT 



128- 



\ + a 






r 11+ 



(4) 



2.2 NAND Gates 

In Fig. 3 the topology of an n-inputs adiabatic NAND gate is shown. The analysis of 
the adiabatic NAND energy consumption is not as simple as that of the inverter, since 
its equivalent linearized circuit, obtained substituting each transmission gate with the 
model in Figs. 2a and 2b, is an RC ladder network if all the input values are high. 



' This leads to slightly overestimate the capacitance associated to the transistors in cut-off 
region, but if Q is comparable or greater the error is not significant 




268 M. Alioto and G. Palumbo 



OUT 






Fig. 3. Adiabatic n-inputs NAND. 



Recently, in [10] a simple and general approach to analyze accurately adiabatic 
gates dissipation was proposed. It was demonstrated that the energy wasted by the 
generic resistor R.. between nodes i and j in the equivalent circuit of an adiabatic gate 
is given by 




where T^. and 7^^ are the time constants associated to the nodes i and j. The time 
constant T],, at a given node i is given by 

no. of nodes 

Td,= (5a) 

k=\ 

where Q and are the capacitance at node k and the DC transresistance gain between 
nodes k and i, respectively [12], This model can be used to evaluate the energy wasted 
by the adiabatic NAND2 for each input value. In Figs. 4a, 4b, 4c and 4d the 
equivalent circuit of the adiabatic NAND2 is shown assuming an input equal to (0,0), 
(0,1), (1,0) and (1,1), respectively. 




(a) (b) (c) (d) 

Fig. 4. Equivalent circuit of an adiabatic NAND2 for input (70,71) 
equal to a) (0,0), b) (0,1), c) (1,0), d) (1,1). 



For each of these networks, applying eq. (5) to each resistor, summing the 
contributions of the resistors and multiplying the results by two to take into account 
the charge and recovery phases, we get 
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E,^, = "^R{2C + Cj (6a) 

E,, = 2"^[r{2C + C, J + R{2Cy ] (6b) 

E,^,=2"^R{2C + C,Y ( 6 c ) 



= 2 



V, 



DD 



R{iC + Cj + R{C + Cj 



(6d) 



The average energy wasted by the adiabatic NAND2, E^^ nandi ’ equal to the 

weighed sum of each term with the probability of the corresponding input. 

In the following, we will assume that each input of an n-inputs gate has equal 
probability to be zero or one, and is statistically independent from the others, hence 
each input value has a probability to occur equal to 1/2". For the NAND2, this leads to 



^ad,NAND2 



'^Ei^Pi^ — ^'^E 



l,m 



l,m 



(7) 



which means that E^ nandi equal to Va times the sum of the energy contributions 
associated to all of the possible input values. Hence, for the NAND2 we get 
1 



^ad,NAND2 
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( 8 ) 



in which we assumed Q comparable or greater than C, and hence we neglected the 
terms which do not depend on Q and in the others we increased the coefficient of C 
to 2 (i.e., the fan-in). The approximation leads to an error below 20% if Q>0.1C, 
which always holds for realistic load capacitances. 

In Fig. 5 the topology of an n-inputs conventional NAND is shown. 




270 M. Alioto and G. Palumbo 





Fig. 5. Conventional n-inputs NAND gate. 



The intrinsic capacitance is the sum of the capacitances of n complementary pairs 
of transistors. To simplify the analysis, let us approximate the capacitance of each 
complementary pair to that of a transmission gate, C. This only leads to a slight 
overestimation if Q is comparable or greater than C. Hence, the average energy 
wasted by the conventional n-inputs NAND for a single computation is equal to 

^conv,NAND2 ~ ^ 



where switching activity was evaluated assuming each input has equal probability to 
be zero or one and is statistically independent from the others. 

To compare the power dissipation of the adiabatic and conventional NAND2 for a 
given load capacitance, Q, switching frequency, assuming an assigned supply 
voltage, let us define the parameter 
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obtained from eqs. (8) and (9) with n=2. 

In the same way, let us introduce the parameters 
pNAND 4 =E,j,NANJE„,„,mND 4 ^ compuTe the NAND3 and NAND4 for an assigned supply 
voltage. The expressions of and easily derived following the same 

procedure used to obtain eq. (8): 
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( 11 ) 
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(7C + Q)-+(5C + CJ-+^^(4C+C,)^ 

+ (3C + Q)'+(C + Cj"+112C'] (12) 

= 1.6^i?(4C + Cj' 
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in which we introduced the same approximations as in (8), which lead to an error 
helow 20% if Q>0.3C and Q>0.42C, respectively. The resulting expressions of F are 
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From eqs. (10) (13) and (14), it is apparent that the power advantage of adiabatic 
NAND gates is proportional to l/T as in the inverter case. 

As done for the simple inverter, let us consider the comparison of the adiabatic and 
the conventional NAND2 assuming the supply optimized for minimum power 
consumption and a given speed constraint. The optimized adiabatic NAND2 energy 

consnmption, nandi ’ fonnd setting Vdz3=4VV in eq. (8). For the conventional 

NAND2 gate, as in the case of the inverter, we set the snpply to the minimum value 
which allows to meet the propagation delay requirement, Xpo- The delay of a 
conventional NANDn gate can be evalnated by substitnting an equivalent NMOS 
device to the n series-connected minimnm NMOS transistors of the pull-down 
network; its aspect ratio, {WIL)^^, is given by that of a single NMOS, (W/L)„, divided 
by the nnmber of series transistors [11]. Hence, the optimized value of the supply for 
a conventional NANDn, VoD.op.conv, is simply obtained substituting the aspect ratio of 
the eqnivalent NMOS to that of a minimum one in eq. (3) (i.e., {WIL)^=(WIL)Jn)\ 
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where eq. (15) was used setting n=2. Analogously, for the NAND3 and NAND4 we 
get 



p _ ^ad,op,NAND3 

^ op, NAND3 ~ 

^ conv ,op ,NAND3 



= 293 



3 + a 



r.(i + 




(17) 



op,NAND4 



''ad ,op,NAND4 



'' conv ,op ,NAND4 



437- 



4 + a 

r I1+V4, 



(18) 



It is worth noting that the expressions comparing adiabatic and conventional logic, 
both for an assigned supply (eqs. 2, 10, 13 and 14) and for an optimized supply (eqs. 
4, 16, 17 and 18), depend only on the normalized time, T^, the normalized load 
capacitance, a, and on 17,^ only for the first case. 



3 Simulation Results 

To test the validity of the proposed expressions. Spice simulations of adiabatic and 
conventional NAND2, NAND3 and NAND4 (for the inverter the validity was 
confirmed in [9]) were performed under different conditions, by using a 0.8 (tm 
CMOS technology. In particular, values of F were evaluated by varying the transition 
period, T, in a range starting from a value which made F lower but close to unity. 
Applied inputs were statistically independent and with an equal probability to be zero 
or one. For each gate, we assumed a load capacitance, C^, equal to 20 fF and 200 fF, 
corresponding to a fan-out of about two and twenty. The simulation runs were 
performed both in the case of an assigned supply equal to V^^=33 V and in the case 
of optimized supply. For the used technology, C=12.1 fF, Fj.=0.74 V, therefore the 
optimum supply for adiabatic gates is V, and T= 9.85T. 

As an example, we report in Fig. 6 and 7 the simulated and predicted values of F 
for a NAND4 with a load capacitance C=2Q fF, assuming an assigned and optimized 
supply, respectively. 



Parameter F (NAND4, CL=20 fF) 




197 591 985 1378 1772 2166 2560 2954 3347 3741 

T(ns)20 60 100 140 180 220 260 300 340 380 

I o SPICE predicted (eq. 12) I 



Fig. 6. Simulated and predicted (eq. 14) value of F^^^ versus transition period for C^=20 IF 
and V^=33 V. 
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0,4 
0,2 
0 

Tn 2461 9845 17229 24613 31996 39380 46764 

T/ns1250 1000 1750 2500 3250 4000 4750 

I o SPICE predicted (eq. 16) I 



Parameter Fop (NAND4, CL=20 fF) 




Fig. 7. Simulated and predicted (eq. 18) value of versus transition period for 

C=20 fF and optimized supply. 



The curves obtained in the other cases are similar. Considering all the cases, the 
error found is always lower than 30%, and in the case of assigned supply tends to be 
lower than the optimized case. Hence, the derived expressions are suitable for 
comparison of adiabatic and conventional gates. 



4 Comparison between Adiabatic and Conventional Gates 

4.1 Power Dissipation 

The parameter F defined for all the considered gates in Subsec. 2 is useful to compare 
the performance of adiabatic and conventional gates for different fan-ins, load and 
supply. 

Note that for an assigned supply the adiabatic advantage linearly decreases with 
load capacitance. This property approximately holds even for an optimized supply. In 

fact, » 4nyln + a for practical values of T such that F<1. Hence, in both cases 

the adiabatic advantage is inversely proportional to the transition period, T, and 
linearly decreases with the load capacitance. 

Let us analyze how F changes varying the fan-in of the gate. Parameter F can be 
written as a product of two functions, one equal to for assigned supply 

and roughly equal to 1/T for the optimized case, and one depending only on 
parameter a, k{a). In Table I, function k{a) is shown for the considered gates both for 
assigned and optimised supply. 



Table 1. Function k{a) is shown for the NOT, NAND2, NAND3 and NAND4 gates for 
assigned and optimized supply. 





^NOT 


^NAND2 


kNANDS 


kmND4 


Assigned 

Too 


16(l-i-a) 


24(2-^a) 


37(3-i-a) 


55(4-^a) 


Optimized 

Too 


128(Ua) 


192(2-i-a) 


293(3-^a) 


437(4-l-a) 
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For low values of Q (a-AO), from inspection of Table I, it is evident that 
advantage of adiabatic logic decreases increasing the fan-in of the gate 

™d increasing fan-in by one determines more than a 
doubling of F, both for assigned and optimized supply. The same observations hold if 
Q is comparable with C (Of=l). If Q is much greater than the parasitic capacitance 
(a-Aoo), the parameter F still increases by increasing the fan-in, but at a slower rate. 
In fact, increasing fan-in by one leads to an increase of F by about 50%. This holds 
both for assigned and optimized supply. 

Summarizing the results, the adiabatic performance gets worse increasing the fan- 
in of the gate irrespective of the load capacitance value. Moreover, the increase rate of 
F due to increase of the fan-in by one decreases from 100% to 50%, considering zero 
to high load capacitance, respectively. 



4.2 Power-Delay Product 



It is of interest to evaluate the power-delay product, PDF, of adiabatic and 
conventional gates, since it is an important figure of merit for digital circuits. It 
measures the efficiency of a design style in the trade-off between power and speed. 

For a given supply voltage, load capacitance and switching frequency, the ratio 
between adiabatic and conventional power-delay product for a generic gate is 



PDP^ 



ad 



F T 



PDP Ft t 

^ conv ^conv^PD ^ PD 



(19) 



where the conventional NAND gate delay is equal to 
T,=n{nC+C^)VJ[pCJWIL)(y,,-V;,\ 

After some simple calculations, from (19) it can be seen that, for all the considered 
gates, PDPJPDP^^^^^ is equal to a coefficient kp^p multiplied by Vp^JVp- 

2)) '], which is only slightly greater than unity for practical values of 



Table 2. Coefficient kp^p for the NOT, NAND2, NAND3 and NAND4 gates. 



k-NOT 


k-NANDl 


k-NANDS 


kNAND4 


16 


12 


12.2 


13.6 



The coefficient kpp^p for the different gates is reported in Table II, which shows that 
the adiabatic logic PDP is always worse than conventional one for every fan-in 
values. Increasing fan-in, excepting for the simple inverter case, the disadvantage of 
adiabatic gates increases. 

In the comparison with an optimized supply at a given speed (T=Tpp), 
PDP JPDP^P^ is simply equal to F^^ before analyzed. Hence, in cases in which it 
makes sense to use adiabatic logic (Fpp<l), the adiabatic gates are always more 
efficient than conventional ones. 
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5 Conclusions 

In this communication adiabatic gates were analytically compared to conventional 
ones for a different fan-in, load capacitance and assuming both an assigned supply 
and an optimized supply to minimize power consumption for a given speed constraint. 
Simple expressions were obtained. It was found that the power advantage of adiabatic 
logic linearly decreases with load capacitance and proportionally increases with the 
transition period of the power clock. Moreover, this advantage decreases by 
increasing gate fan-in for every value of load capacitance, and hence the advantage of 
NAND gates is always lower than that predicted for the inverter. For high values of 
Q the advantage decrease due to fan-in increase worsen more slowly. 

Finally, power-delay product of adiabatic and conventional logic were compared, 
and it was demonstrated that for a given supply adiabatic gates are always worse than 
conventional ones, while for an optimized supply adiabatic gates are always more 
efficient than conventional ones. 
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Abstract. Adiabatic switching might be a possibility to overcome the 
power losses in CMOS due to the charging of capacities. The design of 
adiabatic gates and registers has been examined in the past. The possi- 
bilities offered to the design of logic are evaluated in this paper. 

For this purpose an array multiplier has been chosen as a representative 
for more complex structures. To provide the possibility of comparison, 
it has been realized as an adiabatic circuit as well as using a standard 
CMOS design. In this article special interest has been drawn to the place- 
ment of the registers in the adiabatic circuit. This was done by using a 
modified retiming algorithm. 

Both designs were simulated using SPICE. Although the simulation re- 
sults show a signihcant reduction of power, they have to be interpreted 
with caution. Based on them it is discussed whether the reduction of 
dissipated energy can compensate the required overhead or not. 



1 Introduction 

To show the possibilities of adiabatic circuits it is not sufficient to evaluate the 
concept on single gates. Due to the fact that the registers consume a significant 
part of the area and cause a major part of the power loss their number is very 
important. Since registers are mandatory for the function of adiabatic circuits 
of the proposed type the register count is strongly affected by the architecture. 
An array multiplier has been realized as CMOS and as an adiabatic circuit. To 
maintain acceptable simulation times, a word size of 3 bit has been chosen. This 
gives the possibility to simulate a wide range of different parameters like the 
ramping time T and the transistor size. 

2 Structure of the Multiplier 

We decided to use an array multiplier because of its regular structure [1]. This 
makes the placement of the registers as described in chapter 4 more effective than 
an unsymmetric architecture like the Wallace tree multiplier. The disadvantage 
of having a longer delay than other possible architectures is not important in 
adiabatic circuits because its speed does not depend on the logic depth but on 
the pipeline steps needed. 
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Fig. 1. 3x3 bit array multiplier 



3 Gate and Register Design 

Only three different gates are needed to realize the proposed multiplier: The 
AND gate to compute the partial products as well as full and half adders to sum 
them. The AND gate is used as an example to explain their design. The adders 
are designed accordingly. 




Fig. 2. Design process of an AND gate 



The logic function is first represented by a binary decision diagram (BDD), 
which could be directly realized as a transistor schematic, but contains signifi- 
cantly more transistors than necessary. The BDD can be simplified by choosing 
an optimal order for the inputs and by including through signals in the root 
nodes. In addition only the n-channel transistor is required if the corresponding 
through signal is directly connected to ground. The whole design process is ex- 
plained in detail in [4]. 
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Note that the input signals have to be divided into two classes: Signals which 
are connected to the gates of the transmission gates and are therefore called 
control signals. The other class is called through signals and it is connected to 
the source contacts of the transmission gates. The control signals must have a 
constant value while the through signals are active. Note that all outputs of logic 
gates are always through signals. 




Fig. 3. Signal type conflicts at a full adder 



The full adder shown in figure 3 receives two of its three input signals from 
other adders. As they are the outputs of a logic gate, they are through signals. 
The full adder itself can handle only one input signal as a through signal. So, 
there is a signal type conflict which can not be solved in the same clock cycle 
because a control signal can never be generated from a through signal. The only 
way to overcome this problem is to delay the control signal by one clock cycle 
and use a register to convert the through signal into a control signal. 



INLATCH 



i OUTLATCH 



Fig. 4. Register 



The registers have been designed according to [5] . In general, a register con- 
sists of two latches which both have a delay of half a clock cycle. Most of the 
registers have to accept a through signal at the input and have to provide a 
control signal at the output. The latches have been designed accordingly. The 
inlatch could be realized reversible, as the information about the state of the 
intermediate signal Z is still available in the output O. This was not possible in 
the outlatch because the status of the output O is not stored. In order to avoid 
nonadiabatic charging, diodes were included in the outlatch. 
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The need for registers and the need for the inverted signals lead to an overhead 
of transistors. 





CMOS 


adiabatic 


AND Gate 


6 


7 


half adder 


14 


15 


full adder 


28 


30 


register 


8 


40 


signal type conversion 


0 


6 


3x3- multiplier 


180 


910 



As shown in the table the increased number of transistors results mainly from 
the additional registers. There are none at all in the CMOS design, whereas the 
adiabatic circuit requires 17 of them. This leads to an increase of 5 in the area 
consumption. 

4 Register Distribution 

With a register after each logic gate it is already possible to build any logic 
function. The register can provide any required signal type for the following 
gates, but the resulting structure is fully pipelined. Of course, this is not the 
best solution. Rather the number of registers should be minimized by using the 
output signals as through signals in the next stage as often as possible. This can 
be achieved by using a modified retiming algorithm and a systematic distribution 
of signal type conflicts. 

4.1 Graph of the Adder Field 




Fig. 5. Graph of the adder field with fixed registers 



To perform the retiming algorithm a model of the circuit is needed. The 
circuit is represented by a finite, edge- weighted directed graph G [2] . The vertices 
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V of the graph model the functional elements of the circuit. Each edge e G E 
connects an output of some functional element to an input of some functional 
element and is weighted with a register count. The register count is the number 
of registers in the connection. All inputs are combined in a single vertex, as well 
as all outputs. 

4.2 Register Placement 

Due to problems in solving the arising optimizing problem there is no global 
optimizing algorithm available at the moment [3] . As a solution the task has to 
be divided into two parts. First the assignment of the signal types, which also 
defines the places for the fixed registers which are needed for the signal type 
conversion and second the optimal placement of the registers used to keep the 
timing correct. 







Fig. 6. Register placement 



Assignment of Signal Types. The assignment of signal types to the different 
inputs of a logic gate is not given. During the design of a logic gate at least one 
input can be included in the root nodes, thus it is becoming a through signal. 
There are several possibilities to choose the input included. The question is which 
one should be chosen. 

Since all delays in one path have to be included in all other pathes to a certain 
gate, additional latency for signal type conversion will consume a lot of registers. 
Therefore it is obvious that the minimization of the overall latency is a reasonable 
aim. 

The overall latency is minimized by gradual minimizing the latency for each 
vertex. As they require a register, signal type conflicts are always assigned to 
the predecessor vertex with the lowest latency. Although the method is based 
on a local optimization, a global minimum for the overal latency is achieved. [3] 
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Optimization of the Register Distribution. Once the registers for the signal 
type conversion are placed, they may not be moved anymore. To place additional 
registers, which will ensure that the timing of the circuit is correct, a modified 
retiming algorithm for state minimization [2] is used. The demand for unmovable 
registers can easily be included in the constraints of the retiming algorithm. 
The algorithm must have the freedom to increase the overall latency. Therefore 
the problem becomes a pipelining problem. 

5 Integration in a Design Flow 

The proposed methods for signal assignment and register distribution can easily 
be integrated in a standard design flow. As an input, a complete gate netlist 
is required, which will be composed by any netlist compiler, starting e. g. from 
a VHDL-description. The signal type assignment will then label all signals as 
through or through types, the following retiming algorithm will insert additional 
registers into the gate netlist. This modified netlist may then be handed on to a 
place and route tool or to a logic simulator. The place and route tool, of course, 
has to handle the different signal types, but this can easily be put into practice 
by including signal types into the names of input signals. 



6 Simulation Results 

6.1 Evaluation Criteria 

At first it seems very easy to evaluate the measures to reduce the losses by com- 
paring the power consumption. As soon as speed is considered, this method is 
no longer suggestive. In a CMOS circuit, for example, the power loss is propor- 
tional to the clock frequency. The product of the power loss of a circuit and 
the gate delay is called power-delay-product. In CMOS circuit it is proportional 
to the average energy dissipation per arithmetic procedure. Therefore it seems 
to be a very reasonable mesurement for the reduction of power losses. On the 
one hand, the dissipated energy leads to a warming up of the circuit, on the 
other hand, the energy reservoir might be limited. In a CMOS architecture the 
power-delay-product is independent of the clock frequency. In adiabatic cricuits 
the dissipated energy should decrease appropriate to Ediss ~ 



6.2 Variation of Ramping Time 

Figure 7 shows the average energy consumption over the ramping time T. All 
transistors have the minimal size of I = 0.25/rm and w = 0.5/rm. The voltage has 
been set to 2.5V. The simulation results clearly show that the circuit is working 
adiabatically. For small values of the ramping time T the energy dissipation is 
indirect proportional to T. 

Although the size of the adiabatic circuit is about 5 times larger than that of 
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Fig. 7. Energy loss of mnltiplier in 0.25/r 



the CMOS its energy loss is smaller for ramping times which are longer than 
2ns. This corresponds to a clock frequency of 120Mhz. The minimal losses are 
achieved at T « l/xs. For this T the energy consumption is only 36% that of 
the CMOS circuit. If the ramping time T is longer than 10/iS the static leakage 
currents have to be taken into account, and the dissipated energy is proportional 
to T. 

This simulation is a rather optimistic view of the adiabatic circuit. On the one 
hand, ideal ramps are used to drive the clock phases. On the other hand, losses 
in the generation of the clock phases are not considered. 



6.3 Influence of Channel Length 

Another simulation was run with scaled transistors to allow a prediction for fu- 
ture process technologies. 

The channel length was set to I = 1/rm and the width was doubled. This cor- 
responds to an older process. Although the absolute energy dissipation is much 
higher because of the larger gate capacities, the adiabatic principle works even 
better. The minimum losses are reduced to about 18% at T « lO^s. 

This result allows the prediction that problems will arise with further down scal- 
ing of the transistors due to the increase of short channel leakage currents. The 
minimal energy dissipated is depending on the ratio of channel resistance in the 
conducting and non conducting state as shown in formula 1. 




E, 



■smin 
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Fig. 8. Energy losses of multiplier in 



This ratio has increased with the downscaling of the structures due to short 
channel effects. This effect can be seen in figure 8. 

7 Conclusion 

This work was not capable to give a final descision whether there is a chance for 
adiabatic charging to assert against well established techniques, but there are 
some hints which allow a prognosis of the possibilities offered. 

The realisation of whole logic blocks using adiabatic switching is problematic 
because of the increased number of transistors. Therefore the adiabatic circuit 
has to compensate the increased energy resulting of the transistor overhead. For 
the example shown this was still possible for 0.25/im channel length. It has to 
be verified if this is still possible in the case of smaller transistors. This might be 
a problem because the short channel leakage currents increase if the structures 
are getting smaller. 

To draw a conclusion it can be said that the use of adiabatic switching for the 
design of logic is doubtful as long as MOS transistors are used. 
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Abstract. In this paper, properties of the Logarithmic Number Sys- 
tem (LNS) are investigated which can lead to power savings in a digital 
system. To quantitatively establish power savings, the equivalence of an 
LNS to a linear fixed-point system is, initially, explored and a related 
theorem is introduced. It is shown that LNS leads to reduction of the 
average bit assertion probability by more than 50%, in certain cases, 
over an equivalent linear representation. Finally, the impact of LNS on 
hardware architecture and, by means of that, to power dissipation, is 
discussed. 



1 Introduction 

In the last years, power dissipated in an electronic system has evolved into 
an important design issue, mainly due to the impetus offered by the need for 
portable equipment, as well as the requirement for very high-speed processors [1]. 

Power dissipation minimization is sought at all levels of design abstraction, 
ranging from software/hardware partitioning down to technology-related issues. 
The average power dissipation in a circuit is computed via the relationship 

Ave = a/clkCiUdd. (1) 

where /dk is the clock frequency, is the total switching capacitance, Vdd is 
the supply voltage, and a is the average activity in a clock period. 

A wide variety of design techniques have been proposed [1] , aiming to reduc- 
ing the various factors of product (1). Among them, the successful selection of 
the number system and the proper design of arithmetic circuits has been pro- 
posed as a power dissipation minimization technique [2] [3] , which can affect all 
factors of (1) [4]. 

In this paper, it is shown that the adoption of the Logarithmic Number 
System (LNS) [5] can lead to substantial power dissipation savings, due to the 
reduction of average bit activity and due to the simplification of certain arith- 
metic operations achieved by its utilization. The concept of equivalence between 
LNS and linear fixed-point representation is investigated, in order to define the 
logarithmic word length and the base, as well as to provide a quantitative per- 
formance comparison between the two representations in the context of design 
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for low power. Special attention is paid to equivalence, since in order to quan- 
tify power dissipation savings over an n-bit fixed-point system, it is necessary 
to derive the LNS which provides sufficient range and precision, so that the 
comparison results are meaningful from the application point of view. 

The organization of the remainder of the paper is as follows. In section 2, 
the basics of the LNS encoding are briefly reviewed and its equivalence to linear 
representations is explored. In section 3, the activity reduction made possible 
via the LNS encoding is investigated. Section 4 discusses the complexity of LNS 
operations. Finally, conclusions are offered in section 5. 



2 LNS and Equivalence to Linear Representations 



The LNS representation maps a real number X to a triplet, as follows 

X^{z,s,x = log,\X\), (2) 



where b is the base of the logarithm, 2 : is the zero flag, and s is the sign of X. A 
zero flag is required as, log;, A is not a finite number for A = 0. Similarly, since 
the logarithm of a negative number is not a real number, the sign information 
of A is stored in flag s. Logarithm x = log;, |A| is encoded as a binary number, 
and it can be written as 

X = I.F, (3) 

where I is the integer part and F is the fractional part. 

Traditionally, LNS has been considered as an alternative to floating-point 
representation [6] [7]. However, in this paper, LNS is compared to an n-bit linear 
fixed-point representation and it is shown to provide substantial improvement 
in terms of power dissipation. 

Two are the main issues in a finite word length number system, namely the 
range of the numbers which can be represented and the precision of the represen- 
tation [6]. The representational equivalence of an n-bit linear fixed-point system 
and of an LNS needs to be investigated, as the two representations differ in both 
range and precision behavior, due to the nonlinear nature of the logarithm. 

Let k and I be integers which denote the word length of the integer and 
fractional part of an LNS word, respectively. Let (fc, I, 6)-LNS denote an LNS 
of integer and fractional word length k and I, respectively, and of base b. The 
problem of equivalence between a (fc,l,6)-LNS and an n-bit linear fixed-point 
system, is to compute k and I in such a way that the two number representations 
satisfy a suitably defined criterion, for a particular base b. 

The relative representational error, Crei, of a number A encoded in a number 
system, is, in general, a function of value A and it is defined as 



_ l^-^l 

£rel - ^ 



(4) 



where A is the actual value and A is the corresponding value representable in 
the system. Notice that A A due to the finite length of the words. The relative 
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representational error £rei,LNS) for an (fc, I, &)-LNS is given by (cf. [6], for the case 
b = 2) 

Crel.LNS = ~ (5) 

while for the n-bit linear fixed-point case, the corresponding Crei.FXP is, due to 
definition (4), given by 

Crel.FXP = ’ (®) 

where A denotes an n-bit fixed-point number. From (5) and (6), it can be noticed 
that £rei,FXP depends on A, while Crei.LNS does not. In order to overcome the par- 
ticular difference and be able to compare the precision of the two representations, 
the following two restrictions are posed: 

1. the two representations should cover equivalent data ranges and 

2. the two representations should exhibit equal average representational error. 



The average representational error, Cave; is defined as 



A /I . _L 1 ’ 

2T.min ~ -L 



(7) 



where Amin and Amax define the range of representable numbers. 

Due to definition (7), the average representational error for the fixed-point 
case, is given by 



1 ^ 1 

Cave.FXP = 2n _ I X! ^ ’ 
i—1 

which, by computing the sum on the right-hand side, can be written as 

V'(2”) + 7 

Cave,FXP — 2n ^ ’ 

where 7 is the Euler gamma constant and function t/j is defined through 



(8) 



(9) 



^{x) = — lnr{x), 
ax 



(10) 



where T(x) is the Euler gamma function. 

In the case of the LNS, as Crei.LNS is constant over the range, due to (5), it 
occurs that 



^ave,LNS 



= b^ - 1. 



( 11 ) 



In the following, the maximum number representable in each number system 
is computed and utilized to compare the ranges of the representations. Notice 
that different figures can also be used for range comparison, such as the ratio 
Aniax /Aniin ■ 

The maximum number representable by an n-bit linear integer is 2” — 1; 
therefore the upper bound of the fixed-point range is given by 



.FXP 

-^max 



= 2 ” - 1 . 



(12) 
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The maximum number representable by (fc, /,&)-LNS encoding (2), is 

^LNS _ .2*’ + l-2-' 

^max — " 



(13) 



Therefore, according to the equivalence criteria posed earlier, in order that an 
LNS is equivalent to an n-bit linear fixed-point representation, the following 
restrictions should be simultaneously satisfied: 



4 LNS > 4 FXP 
■^max — -^max 

Cave, LNS < Cave.FXP 



(14) 

(15) 



Hence, from (9) and (11)-(13), it is obtained that 



^2'“-|-l-2-‘ 



> 2 ” - 1 






- 1 < 



V^(2") + 7 
2 " - 1 ’ 



(16) 

(17) 



which, when solved for k and /, give 



I = 



-log2logf,(l 



k = [log 2 (log, (2" 



V>(2") + 7 ' 

2 " - 1 ’ 

l)+2~‘ - 1)] . 



(18) 

(19) 



The above analysis can be summarized by introducing the following theorem. 

Theorem 1. A {k,l,b)-LNS covers a range at least as long as an n-bit fixed- 
point system with an average representational error equal or smaller to that of 
the fixed-point system, when I and k are given by (18) and (19), respectively. 

Values of k and I that correspond to various values of n for various values of b, 
can be seen in Table 1. 

While the word lengths k and I computed via (18) and (19) meet the posed 
equivalence specifications (14) and (15), LNS is capable of covering a significantly 
larger range than the equivalent fixed-point representation. Let rieq denote the 
word length of a fixed-point system which can cover the range offered by an LNS 
defined through (18) and (19), or, equivalently, let Ueq be the smallest integer 
which satisfies 

2"eq _ 1 > (-20) 

From (20) it follows that 

Ueq= [(2'= + l- 2 -')log 2 6] . (21) 



It should be stressed that, when rieq > n, the precision of the particular fixed- 
point system is better than that of the LNS derived by (18) and (19). Equation 
(21) reveals that the particular LNS, while meeting the precision of an n-bit 
linear representation, in fact, covers the range provided by an Ugq-bit linear 
system. 




Logarithmic Number System for Low-Power Arithmetic 289 



Table 1. Correspondence of n, k, I, and rieq for various bases b. 
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3 Power Dissipation and LNS Encoding 

In this section, it is shown that assuming a uniform distribution of input linear 
n-bit numbers, the distribution of bit assertions of the corresponding LNS words, 
reveals that LNS can be exploited to reduce the average activity. 

Let po^i{i) be the bit assertion probabilities, i.e., the probability of the ith 
bit transition from 0 to 1. Assuming that data are temporaly independent, it 
holds that 

Po^i{i) = Po{i)Pi{i) = (1 - Pi(*))Pi(*), (22) 

where po(*) and pi{i) is the probability of the tth bit being 0 or 1, respectively. 
Due to the assumption of uniform data distribution, it holds that 

Po{i) = Pi{i) = ^, (23) 

which, due to (22), gives 

Po^i{i)=\- (24) 

Therefore, all bits in the linear fixed-point representation exhibit an equal 

Po^i{i), t = 0, 1, . . . ,n - 1. 

Activities of the bits in an LNS-encoded word are quantified under similar 
assumptions. Since there is an one-to-one correspondence of linear fixed-point 
values to their LNS images defined by (2), the LNS values follow a probability 
function, identical to the fixed-point case. In fact the LNS mapping can be 
considered a continuous transformation of the discrete random variable X, which 
is a word in the linear representation, to the discrete random variable x, an 
LNS word. Hence the two discrete random variables follow the same probability 
function [8]. 

However, the probabilities of bit assertions in LNS words are not con- 
stant as po^i(f) of (24); they depend on the significance of the tth bit. To 
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evaluate the probabilities the following experiment is performed. For 

all possible values of X in a n-bit system, the corresponding [log^ X\ values in a 
{k, /, 6)-LNS format are derived and probabilities pi{i) for each bit are computed. 
Then, is computed as in (22). 

The actual assertion probabilities for the bits in an LNS word, (i), are 
depicted in Fig. 1. It can be seen that po^i(*) for the more significant bits 
is substantially lower than po^i{i) for the less significant bits. Also, it can be 
seen that po^i{i) depends on b. This behavior, which is due to the inherent 
data compression property of the logarithm function, leads to a reduction of the 
average activity in the entire word. Average activity savings percentage, Save is 
computed as 

where it has been used that Poi? (*) = for * = 0, 1, . . . , n — 1, n denotes the 
length of the fixed-point system, and the word lengths k and I are computed 
via Theorem 1. Savings percentage S'ave is demonstrated in Fig. 2(a) for various 
values of n and b, and it is found to be more than 15% in certain cases. 

However, as implied by the definition of Ueq in (21), the linear system which 
provides an equivalent range to that of a {k, I, 6)-LNS, requires rieq bits. If the re- 
duced precision of {k, 1, 5)-LNS compared to neq-bit fixed-point system, is accept- 
able for a particular application, is used to describe the relative efficiency 
of LNS, instead of (25), where 



S' = 1 - 



E k-\-l — 1 

i=0 






0.25ne 



100 %. 



(26) 



Savings percentage is demonstrated in Fig. 2(b) for various values of n 
and b. Savings are found to exceed 50% in some cases. Notice that Fig. 2 reveals 
that, for a particular word length n, the proper selection of logarithm base b can 
significantly affect the average activity. Therefore, the choice of b is important 
in designing a low-power LNS-based system. 



4 Power Dissipation and LNS Architectnre 

In the previous section, it has been shown that the LNS representation is bene- 
ficial over the fixed-point representation in terms of the average bit activity. In 
this section, the impact of LNS on the architecture is discussed. 

LNS exploits properties of the logarithm to reduce the strength of several 
arithmetic operations, thus it leads to complexity savings. By reducing the area 
complexity of operations, the switching capacitance Cl of (1) can be reduced. 
Furthermore, reduction in latency allows for further reduction in supply voltage, 
which also reduces power dissipation [1]. 

Let X and y be the (fc, I, 6)-LNS images of the linear quantities X and Y. The 
transformation of operations is summarized in Table 2. Table 2 shows that n-bit 
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Fig. 1. Activities against bit significance i, in an LNS word, for n = 8 (a) and n = 12 
(b) and various values of the base b. The horizontal dashed line is the activity of the 
corresponding n-bit fixed-point system. 

multiplication and division are reduced to {k + ^)-bit addition and subtraction, 
respectively, while the computation of roots and powers is reduced to division 
and multiplication by a constant, respectively. For the common cases of square 
root or square, the operation is reduced to left or right shift respectively. For 
example, assume that a n-bit carry-save array multiplier, which has a complexity 
oi V? — n 1-bit full adders (FAs) is replaced by an n-bit adder, which, assuming 
k + I = n, has a complexity of n FAs, for a ripple-carry implementation [6]. 
Therefore, multiplication complexity is reduced by a factor rc[^^ given as 

— n ^ 

tcl = = n - 1. (27) 

n 

Equation (27) reveals that the reduction factor rcj^ grows with the word length n. 

However, addition and subtraction are complicated in LNS, since they re- 
quire a table look-up operation for the evaluation of log^(l ± although 
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Save' 




Fig. 2. Percentage of average activity reduction due to the use of LNS, compared 
to n-bit (a) and to rieq-bit (b) linear fixed-point system, for various bases b of the 
logarithm. The diagram reveals that the optimal selection of b depends on n and it can 
lead to significant power dissipation reduction. 



different approaches have been proposed in the literature [9] [10]. A table look-up 
operation requires a ROM of n x 2" bits, a size which can inhibit LNS utiliza- 
tion for large values of n. In an attempt to solve this problem, efficient table 
reduction techniques have been proposed [11]. As a result of the above analysis, 
applications with a computational load dominated by operations of simple LNS 
implementation, can be expected to gain power dissipation reduction due to the 
LNS impact on architecture complexity. 

Finally, it should be noted that overhead is imposed for linear-to-logarithmic 
and logarithmic-to- linear conversion. Conversion overhead contributes additional 
area and time complexity as well as power dissipation. However, as the number 
of operations grows, the conversion overhead remains constant; therefore its con- 
tribution to the overall budget becomes negligible. 
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Table 2. Impact of LNS on arithmetic operations. 



multiply 


Z = XY = b^V» = b^+y 


z = log^ Z = xPy 


divide 


y — X — — h^-y 

Zj y 52/ — 0 


z = X — y 


root 




m, integer 


power 


Z = X^ = 


2 = mx, m, integer 


addition 


Z = XPY = b^ PW = b^{l -h b^-^) 


z = X + logi,(l -1- 6^““") 


subtraction 


Z = X - Y = b^ -b^ = - b^-^) 


z = x + iog^(i - by-^) 



5 Conclusions 

The impact of LNS onto power dissipation of a digital system, which performs 
arithmetic operations, has been investigated. The discussion is based on proposed 
conditions of equivalence between an LNS and fixed-point representations. It is 
shown that LNS can lead to significant average bit activity reduction. It has been 
found that the efficiency of the LNS representation is dominated by the choice 
of word lengths k and I, and — the often neglected parameter — b. Furthermore 
the impact of LNS onto architecture has been briefly discussed to show that 
architecture simplification is also possible, in certain cases. 

LNS, with a combined exploitation of savings in signal activity and of savings 
due to architectural simplification for suitable applications, can be a successful 
candidate for the implementation of future low-power computationally-intensive 
systems. 
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Abstract. This paper presents an application where a self-timed approach 
reduces the switching noise in a mixed analog-digital circuit. Switching noise is 
of important concern in mixed signal systems, since it limits the performances of 
the analog part. Specifically, the digital core of an Analog to Digital converter has 
been designed following both a synchronous design style and another self-timed. 
Comparison between both versions shows the self-timed implementation reduce 
up to 50% the switching noise corresponding to the synchronous implementation. 



1 Introduction 

CMOS integrated circuits for mixed analog-digital systems are increasing in interest 
and importance. There is a continuous trend toward high-frequency, high-resolution, 
low-power and low-voltage analog circuitry included in a common substrate with com- 
plex high-performance digital circuitry. However, due to digital switching noise, that 
adversely affects sensitive analog circuitry via substrate-coupling, it is difficult to real- 
ize high resolution analog circuits on the same substrate with digital circuitry [1], [2]. 
There exist some techniques to reduce this noise, from the “analog” point of view [3]. 
Only recently, this problem is being considered from the digital domain [1], [2], [4] and 
[5], with the aim of designing low-switching-noise digital families. 

The switching noise is produced by the variation in the supply current due to transi- 
tions of digital signals. These variations can affect the analog circuitry, via substrate 
coupling, reducing its performances and even causing operating transient and perma- 
nent errors. A way of measuring this parameter consists in monitoring the supply cur- 
rent, since the variation from average level is directly proportional to this noise [1]. So, 
in order to measure the switching noise, we are going to use the maximum variation of 
the supply current as an undirected measurement. 

The self-timed approach can be seen as an advantageous alternative to the synchro- 
nous circuits in this type of applications [6]. On one hand, the self-timed cells decide 
themselves the need of its operation without the use of a global clock, so it is easy to 
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avoid the operation when it is unneccesary. On the other one, the operation of the dif- 
ferent blocks is not synchronized, meaning that current consumption is not simulta- 
neous, being distributed throughout the time, and the switching noise will decrease. The 
self-timed design presented in this paper, the digital core of an A/D converter, has been 
realized with the structures introduced in [7]. These structures use a half-handshaking 
protocol and the so-called SODS-QF structure. The main advantage of this structure is 
that it does not need any memory element to solve the early precharge problem [7], ob- 
tain a structure with less area and additional reduction of switching noise, since the 
memory elements are ones of the main generators of this noise. 

This communication is divided as follows. In section 2 the main characteristics of 
the application to be implemented, the digital core of an A/D converter, are shown. Sec- 
tion 3 deals with the synchronous design of the circuit. In section 4 we include the self- 
timed design. Section 5 includes some results as well as a comparison between both im- 
plementations. And finally, section 6 gives the main conclusions. 

2 A Pipelined A/D Converter as Example of Mixed- Signal Circuit 

A general scheme of a pipeline ADC is shown in fig. 1. It is composed of k stages 
connected in series, each one contributing to the output code with a certain n, number 
of bits. An i-th converter stage comprises a n,-bit sub- ADC, a n,-bit sub-DAC, and a res- 
idue interstage amplifier with a gain G,- depending on the stage resolution, the output 
y ,+7 of this stage is known as residue and it is the input of the next stage in the cascade. 
In many practical realizations, both the sub-DAC and the residue amplifier in each stage 
are implemented by a unique circuit known as MDAC (multiplying digital-to-analog 
converter). Calibration and correction techniques are usually included to reduce effects 
caused by component mismatches, gain errors and nonidealities in high-speed/high-res- 
olution converters. Calibration is mainly aimed at reducing effects caused by compo- 
nent mismatches and gain errors in the stage MDACs, and it is necessary in converters 
with more than 10 bits of effective resolution. On the other hand, the goal of digital cor- 
rection is to eliminate the effects that nonidealities in the sub- ADCs have on the overall 
converter operation. 

We have designed the digital part of a pipeline A/D converter including self-correc- 
tion and self-calibration techniques. In particular, the case chosen corresponds to the 
prototype reported in [8], which also include Design for Test strategies. 

As a mixed- signal point of view, a pipelined A/D converter has an analog part 
(named APB in fig. 1), performing the analog-to-digital conversion, and a digital part 
(named DCAD in fig. 1), performing subcodes synchronization, correction, calibration, 
control capabilities and must be prepared for different operation modes. Basically, the 
synchronization block is a variable-length FIFO array, the correction logic is a set of 
arithmetic operators, the calibration logic is a finite state machine, with arithmetic logic 
and RAM memories to store the error codes after calibration, and the control block, 
clock generation and test pattern generators for the analog part are provided by finite 
state machines. More details of the particular architecture of DCAD can be found in [9]. 

The DCAD block in fig. 1 has been designed following a synchronous strategy (Sec- 
tion 3) and a self-timed approach (Section 4), in order to compare them especially in 
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Fig. 1. Basic structural representation of a pipelined ADC with digital self-calibration/ 
correction capabilities 

aspects regarding the generation of digital switching noise. We have selected as exam- 
ple a pipelined ADC with 10-bit, 10 Msamples/s, 6 stages with 2 or 3 bits of resolution 
(programmable) and 1 stage with 1 bit resolution, with test, self-calibration and self- 
correction capability. 

3 A Synchronous Design of the Digital Part of the A/D Converter 

The synchronous design of the DCAD block in fig. 1, has been realized following a 
classical top-down methodology, by using the automatic synthesis tools integrated in 
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Mentor Graphics. The design flow has, as starting point, the VHDL description of the 
circuits, performed at a RT level. This description has been satisfactorily verified fol- 
lowing the verification methodology proposed in [9]. 

The scheme showed in fig. 2 only includes the blocks corresponding to the subcodes 
synchronization and correction. We only show these blocks because they contain the 
main differences when comparing to the self-timed implementation, as it will be seen 
in the next section. The synchronization block include FIFO registers, while the correc- 
tion block is a cascade of correction cells, called CfR. Also, we find a cell called CfR_h 
with a double functionality: performing the correction of the subcodes and generating 
an address for RAM memories for calibration purposes. Once the verification has been 
realized, the tool has generated automatically the netlists at a gate level. 

In this implementation, the clocking scheme is of important concern. This scheme 
uses three clock signals: exl, ex2 and ckb. Clocks exl and ex2 are used both to control 
the pipeline operation of analog blocks (sampling analog input and providing digital 
output) and to synchronize the subcodes provided by the APB. The signal ex2 can be 
considered as exl and it is neccesary due to the operation of the converter, since pro- 
cessing takes place in both edges of signal exl. The ckb signal is used for controlling 
the calibration, test and control blocks. This signal is a shifted version of exl, and is 
needed for processing the code coming from the correction block. More details about 
the synchronous implementation can be found in [9]. 



Stage n° 1 Stage n° 2 Stage n° 3 Stage n° 4 Stage n° 5 Stage n° 6 Stage n° 7 




Fig. 2. Scheme of the synchronous implementation at a block level. 



4 A Self-Timed Design of the Digital Part of the A/D Converter 

When designing self-timed systems, one of the main problems is the overhead in hard- 
ware resources. Taking this into account, we have implemented using self-timed tech- 
niques, only those blocks that could take the main advantages of self-timed philosophy. 
These advantages are maximized when there is a great dependency between operation 
and input data [10]. So, we have selected the subcode synchronization and correction 
blocks to be implemented by a self-timed approach [11]. 
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The inclusion of a self-timed circuit between two clocked systems (the APB and the 
digital part different of synchronization and correction) forces a serious compatibility 
between internal self-timed protocol signals and clocks signals for the rest of the circuit 
(exl, ex2 and ckb). Signal exl is used to validate the data coming from the APB, while 
the output of the correction logic is captured by ckb. Also, we must implement an inter- 
face between both synchronous and self-timed worlds. This interface will be a set of 
flip-flop cells, named bistable in fig. 3. The control signal of these flip-flops is the exl 
signal, indicating the moment when output codes are valid. 

The synchronization block, consisting of a self-timed FIFO array, has the function 
of ensuring the output code is generated with correct data, that is, subcodes correspond- 
ing to the same analog input value. In the self-timed implementation, we must force all 
input data to be valid, delaying the enabling signal (exl) a semiperiod in the shift reg- 
ister on each stage. For this reason, we must add a new cell, called init. Because of the 
programmability of the converter and the need of adding the cell init, we can give it oth- 
er functionality: determining the need of operation in a specific self-timed register col- 
umn. Thus, we can reduce the switching noise and the power consumption avoiding the 
unnecessary operations of idle stages. 

Because of the programmability in the number of output bits (2 or 3) from analog 
cells, we must add a new cell, called cod_gen. In order to minimize the hardware re- 
sources, we have substituted the last register (Reg3) for this new cell in every FIFO. For 
calibration purposes, the input data of the two most-significative correction blocks must 
be specifically provided by cod_gen cells. 



Stage n° 1 Stage n° 2 Stage n° 3 Stage n° 4 Stage n“ 5 Stage n“ 6 Stage n° 7 




^ Protocol signals To the calibration, test, control and clock blocks. 

^ Data signals 

Fig. 3. Scheme of self-timed implementation at a block level. 
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4.1 Set of Self-Timed Cells 

The cells compounding the self-timed block are init, reg, cod_gen, CfR and bistable, 
connected to build the self-timed block as shown in fig. 3. 

The cell init has two functions: delaying the exl signal from the previous stage and 
determining when the current stage must operate. The operation set for init is (eqs. 1-4): 



A = reset(clk_in + A) (1) 

e = Z(5,. + C)+L5,._i + 2o (2) 

Ri„ = clkJnQ (3) 

clk_out = Aclk_in (4) 



where signal A is an internal signal detecting the arrival of the first rising edge of exl, 
signal Q identifies the operation status of the current stage, signal is the input request 
to this self-timed stage and signal clk_out is the output request for the following stage. 

The cell reg latches temporally the data coming from the analog part APB, in order 
to synchronize the subcodes. In the APB block, there are two kind of cell generating 
subcodes, STG and ADCk (fig. 1), so there would exist two kind of cell reg depending 
on the number of bits they have to latch. The cell STG has 2 or 3 three output bits, then 
the cell reg connected to it will have to latch three bits. While the cell ADCk has one 
output bit, then its cell reg will have to latch only one bit. 

The cell cod_gen generates the input codes to the correction block. The operation 



set performed by these cells is (eqs. 5-7): 




Aq = b^ib^+df) 


(5) 


hi = bid. + b2d. 


(6) 


hi = 


(7) 



where d; is the control signal determining the number of bits generated by cells STG, b; 
is the subcode generated by the STG and h; is the input to correction block. 

The cell CfR corrects the subcode b; generated by the analog part, and so, it is the 
core of the correction block. The operation set is (eqs. 8-10): 

Oq = car_in © Hq (8) 

Oj = /i[ © {car_inho) (9) 

O 2 = h2d- + h^d- + car_inhQ{d- + hf) (10) 

where hj and dj have the same meaning than for cell cod_gen, and car_in is the output 
02 of the CfR of the preceding stage. In the case of the first CfR in the chain, this input 
is the bit latched in the last reg. 
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In order to synchronize the output of self-timed blocks with the synchronous part of 
DCAD block, we have used the bistable cell. The control signal of these flip-flops is the 
local validation signal from the previous stage. We have ensure that there is not setup 
time violation in the flip-flop of biestable cell by including additional delay. Also, the 
precharge non- valid data are filtered and will not pass to the synchronous block. 

4.2 Design Process of the Self-Timed Implementation 

The design process has taken four phases. The first one was a verification of the self- 
timed circuit both at a functional level using VHDL, and an electrical level with 
HSPICE. 

The high-level description includes behavioural modeling of cells. The verification 
at this level has been carried out with an extent set of input patterns. The outputs pat- 
terns have been used as input patterns to the rest synchronous blocks, while the global 
converter has been verified following the methodology presented in [9]. 

Once we have verified the circuit at a high level description, we have validated the 
design with HSPICE. Again, we have only verified the behaviour, since the character- 
ization will be realized via the extraction of the layout and with the integrated prototype. 
The results of these simulations have shown a correct function of the global circuit. 

The self-timed cells have been laid out in a full-custom style using MAGIC. The 
technology used was 0.6 |im CMOS with double metal layer. Our strategy was planned 
to draw all cells layouts with the same length or width to assembly the different blocks 
in rows. In the table 1, we show the size of each cell. The whole self-timed systems has 
been a result of assembling the cells according the schematic of fig. 3. The global circuit 
has been simulated with HSPICE to verify its correct behaviour, including parasitic ef- 
fects. 



Table 1. Size of the self-timed cells 



Cell 


Size (pm x pm) 


init 


39x38 


reg_l 


90x28 


reg_3 


71 x50 


cod_gen 


75x50 


CfR 


105 X 60 


bistable 


85x 18 



5 Implementation, Simulation Results and Comparison 



A synchronous and a self-timed version of the DCAD block have been integrated in 
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a 0.6 |im CMOS technology, with double metal layer. The microphotograph of both im- 
plementation is shown in fig. 4. The self-timed block (synchronization and correction) 
is highlighted, takes a 7% of the total area of the self-timed DCAD and has 3375 tran- 
sistors. The total core area is 2016 pm x 1689 pm for the synchronous and 2012 pm x 




(a) (b) 



Fig. 4. The microphotographs corresponding to (a) synchronous implementation and (b) self- 
timed implementation of the DCAD block. 

1936 pm for the self-timed. Thus, the area overhead of the self-timed implementation 
is about 14%, when compared to the synchronous one. 

A comparison in terms of speed, has not been performed since the analog part is 
slower than the digital part, regardless the implementation scheme used for the digital 
part. Generally, analog circuits are slower than the digital ones, so speed performance 
of digital part is not significative in most mixed-signal circuits, including our case of 
study (the APB runs at 10 Msamples/s, that is easily reached by the DCAD, able to work 
up to lOOMHz). 

To make a comparison in terms of power consumption, we have only considered the 
synchronization and correction blocks because they are the only difference between the 
synchronous and the self-timed implementations. We will suppose that the other blocks 
will have a similar power consumption. The table 2 shows the power consumption in 
three cases: the synchronous implementation, with the self-timed implementation with 
all the stages operating and the self-timed implementation with only four stages operat- 
ing. The mayor average power consumption of the self-timed version is due to the static 
consumption of the SODS-QF structures during the early precharge phase, as well as to 
the hardware excess. 

Since the switching noise is a limiting factor in mixed-signal circuits, we have de- 
voted great efforts to make a fair comparison. We have obtained the minimum (nega- 
tive) value of supply current as a direct measurement of switching noise for the self- 
timed block and the synchronous counterpart. 

In fig. 5, we show the waveform corresponding to the supply current for the syn- 
chronous block. We can see that the maximum value is above from 40 mA., and accu- 
rately 45.8 mA. As all operations are done in the transitions of the clock signal, exl or 
ex2, the widths of the peaks are very small. Then the peak of current, and the switching 
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Table 2. Measurement related to power consumption corresponding to the self-timed 
(synchronization and correction blocks). 



Power 


Synchronous 


Self-timed 


Self-timed 


Consumption 


(all FIFOs working) 


(four FIFOs working) 


Average (mW.) 


18.8 


37.4 


16.5 


Maximum (mW.) 


219 


118 


86.8 
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Fig. 5. Supply current corresponding to the synchronous implementation of the synchronization 
and correction blocks. 

noise, will be high. 

In the fig. 6, we can see this measurement for the self-timed block when the opera- 
tion is performed in all stages, with a maximum value of 23.7 mA. When it is performed 
in the case of only four stages operating, the reduction in the last case is about 36%, due 
to the three first stages do not perform any operation and so they do not consume any 
supply current. Also, we can see how the current peaks are wider than in the synchro- 
nous implementation. This means that the operation is less centralized and the different 
blocks do not need supply current at the same time. So the maximum value of these 
peaks is less than in the synchronous case. We can appreciate a low-value static power 
consumption, due to the operation of self-timed cells in a situation of early precharge. 

As a final comparison in terms of switching noise, the self-timed implementation 
has a better behavior than the synchronous implementation, being approximately about 
50% of the synchronous measurement if we compare with the case in which all stages 
have to operate. In the cases in which all stages do not have to operate, this difference 
will be greater because the synchronous value will hold while the self-timed value will 
decrease. 

6 Conclusions 



In this paper, we have introduced the implementation of the subcode synchronization 
and correction logic corresponding to the digital part of a pipelined A/D converter, us- 
ing two design techniques, one synchronous, and other self-timed. One of the main oh- 
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Fig. 6. Supply current corresponding to the self-timed implementation of the synchronization 
and correction block when all stages operate. 



jectives of our work is to realize a comparison in the most significative parameters, such 
area, speed, power consumption and, mainly, digital switching noise. 

According to the parameters obtained, we can conclude that both implementations 
have a quite similar characteristics. But, in mixed-signal Analog-Digital circuits, the 
most restrictive parameter is the switching noise, and considering this parameter, the 
best implementation is the self-timed one. 
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Abstract. In integrated mixed-signal circuits signal integrity is affected by 
parasitic substrate coupling. Therefore, substrate crosstalk analysis has to be 
performed in layout verification. The PARasitic COUpling Model GeneratoR 
for Substrate (PARCOURS) applies a three-dimensional model for the substrate 
considering conductivity and permittivity if required. As a remarkable feature 
PARCOURS uses different levels of accuracy. The highest level integrates 
circuit elements with multiple substrate terminals in order to model the flow of 
parasitic currents in the vicinity of the die surface. The lowest level simplifies 
the substrate terminal as a point connection. A commercial videochip has been 
examined with the introduced approach. 



1 Introduction 

Three important factors impact the performance of today’s mixed-signal integrated 
circuits with respect to substrate coupling. The decrease of feature size results in 
tighter coupling due to higher vicinity. The increase of operation speed of digital 
parts leads to more noise spread into the substrate and the decrease of the signal-to- 
noise ratio leads to a higher sensitivity against disturbances. This is why several 
recent publications focus on substrate coupling [1-11]. They can be divided into 
experimental [2,3,4], finite-element methods (FEM) [1,6,9,10] and boundary element 
methods (BEM) [5,7,8,11]. Most of them discuss operation frequencies below 1 GHz. 
Only [10] shows results for operation up to 40 GHz. The output of most of the 
discussed algorithms is an electrical network representing the substrate as purely 
resistive. For operation in the GHz range this is no longer valid. In order to handle the 
complexity of large circuits simplifications are used. The most important 
simplification is to treat the substrate as a semi-conducting semi-space with a flat 
surface. The devices contact the substrate through a conducting layer on top of the 
surface. All BEM-approaches make use of this simplification. The FEM-approaches 
use more complex three-dimensional models. Common to all discussed approaches is 
that the substrate space is modeled as a stratified medium composed of several 
homogeneous layers, which are characterized by their conductivity and permittivity, 
respectively. Our approach is able to deal with rather small (mostly analog) circuits in 
a complex technology (i.e. Bipolar or BiMOS) which need three-dimensional model- 
ing and large (digital) circuits in a simpler technology (e.g. CMOS, TL) that are 
complex due to the number of involved elements. 
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Fig. 1. Simplified Substrate Model 



2 Discretization 

PARCOURS uses FEM-discretization. The modelling procedure uses a stratified sub- 
strate model corresponding to the substrate doping profile. The layout data defines the 
topology of the mesh for each layer. Several layers form the 3D-substrate region and 
are stacked vertically. In order to reduce the complexity of the mesh and consequently 
of the derived electrical network a non-rectangular gridding algorithm is applied. The 
applied method is called Voronoi Tessellation and was first published in [6]. 



2.1 Voronoi Tessellation 

Every object that is connected to the substrate like transistors (MOS and Bipolar), 
guard-rings, wells, or tie-downs leads to geometrical points on the surface of the top 
nodeplane. 




Fig. 2. Voronoi Diagram and Delaunay Triangulation 

These points are generators for the Voronoi mesh. Algorithms for Voronoi 
Tessellation have a worst-case time complexity of O(NlogN), with N being the 
number of generators [12]. We chose an insertion algorithm that builds the Voronoi 
Diagram by inserting the generators in turn. The advantage of this algorithm is that 
the mesh can be extended or changed afterwards. It is quite simple to add another 
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generator (e.g. corresponding to an additional substrate tie-down) by locally changing 
the former mesh. Critical, with respect to runtime is the search for the already 
inserted, next neighbor. In order to accelerate the search the generators are organized 
in a quartemary tree whose branches correspond to layout areas. 




Fig. 3. Quarternary Tree Structure 



Fig. 3 shows a layout with some components and the corresponding quartemary tree 
for some of the components. The Voronoi Tessellation leads to tiles which are used to 
build up the electric mesh following the box-integration method. The components are 
stored in the leafs of the branches (i.e. A and B in Fig. 3). Some leafs remain empty 
(i.e. II and 12). 



2.2 Box-Integration Method 

We assume that the electric field is homogeneous within a tile. Furthermore the 
conductivity a is assumed to be constant within a tile. 




Fig. 4. Box-Integration Method 



The resistance between two nodes is calculated using formula (1). 



R 



AB 



d 

(T ■ w ■ h 



( 1 ) 



For high frequency applications as [10] it is necessary to take permittivity into 
account. So we assume that the permittivity e is also constant within the tile. Then the 
capacitance can be calculated from (2). 
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To abide by our first assumption, the homogeneous electric field inside a tile, 
additional generators are added to the Voronoi Diagram in zones where generator 
density is low. 



3 Runlevel Concept 

The presented system is the first approach that is capable to handle small-sized 
layouts with complex (analog) circuitry and rather large circuits with many 
components. The complex analog circuit requires a very detailed extraction whereas 
for large circuits a rough extraction is adequate which is accurate enough to simulate 
the main sources of noise. 



RL 1 


2D - substrate, 
unstructured CE, 
point connections, 
planarized surface 










RL2 


area connections 








RL3 


3D - substrate 






RL4 


no planarization 




RL5 


structured circuit elements (CE) 



Fig. 5. Overview of Available Runlevels 



PARCOURS can be used with several runlevels. The complexity of the algorithm and 
the necessary information rise with higher runlevels (Fig. 5). “CE” stands for “circuit 
element” - a transistor,a resistor,a capacitor, a well containing several MOS- 
transistors. 



/ - i - / / Az 

RL1 RL2 

Fig. 6. Runlevel 1 and 2 



Runlevel 1 (RL 1) starts with a rather rough approach, assuming a substrate with only 
one layer. The surface of the substrate is planarized. The circuit elements are 
connected by only one point connection to the substrate mesh. The resulting network 




310 A. Hermann et al. 



is small and adequate for a fast extraction of a large number of MOS-transisors in a 
digital application. Runlevel 2 (RL 2) uses rectangular shapes as connecting windows 
to the substrate. For each well and each device on the substrate several Voronoi 
generators form the rectangular shape (Fig. 7). The accuracy of the electrical substrate 
network is higher, but the network itself is larger. 



y 




X 



Fig. 7. Rectangular Interconnects and Corresponding Voronoi Tessellation 



Runlevel 3 (RL 3) uses the stratified substrate with several nodeplanes. The surface is 
still planarized, Fig. 1 gives an example. The next runlevel (RL 4) uses a real three- 
dimensional substrate model (Fig. 8). 
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Fig. 8. Runlevel 4 

“Substrate” denotes the planar substrate model of runlevel 3. “EPI” stands for an 
epitaxial layer, “ISO” stands for another layer in the element region, e.g. a guardring- 
diffusion or a trench structure. Note, that the existence of an epitaxial layer is only an 
option. The circuit elements (CE) normally interact with the substrate via depletion 
layer capacitances. Apart from wells containing several transistors, all circuit 
elements model this depletion layer capacitance themselves. All Voronoi generators 
adjacent to such a well are connected to the same substrate node of the involved 
circuit element. Eig. 9 shows the main feature of runlevel 5 (RL5): Eor a selected 
number of circuit devices special models are applied to model the geometrical 
structure of the device. Therefore, they are provided with multiple substrate terminals. 
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Fig. 9. Runlevel 5 Uses Directed Substrate Connections 



3.1 Model Extensions 

The used design environment uses BiMOS-technology and bjt503 models for the 
bipolar transistors and MOS9 models for the MOS transistors. The bjt503, also known 
as MEXTRAM model, incorporates a parasitic transistor formed by the base (P), col- 
lector (N) and the substrate (P). This parasitic PNP-transistor is modeled by a junction 
capacitance between collector and substrate and a current source injecting current into 
the substrate. A special transistor model with five substrate terminals has been 
programmed and is used to simulate the device in RL 5 [15]. 



4 Interface 

PARCOURS can use the Dracula database and is now augmented to access Cadence 
Diva database. Fig. 10 illustrates the flow to perform a substrate coupling analysis on 
a design given in Cadence DIVA. PARCOURS reads the netlist extracted from 
Cadence Design Framework. Technology constraints and control statements are 
defined in a Technology File. We use the Cadence Database Access (CDBA) which is 
an interface that enables programs to access the internal DIVA database. Required 
data is taken out of the database. The content depends on the runlevel. With the 
collected input PARCOURS generates the equivalent electrical network for the 
substrate and connects it to the devices. The output is a netlist for the network 
simulator Spectre. For runlevel 1 to runlevel 4 it is in Spice syntax. Spectre is able to 
simulate regular Spice-netlists, however, additional modules using the hardware 
description language SpectreHDL can also be used. The extended models with 
additonal substrate nodes used in runlevel 5 are written in SpectreHDL. Simulating 
with HDL-models is more time-consuming than using the internal models written in 
C, but for the experimental stage they are easier to handle. 

The prototype of our substrate extractor works with Cadence's Design Framework. 



5 Model Validation 

The substrate model is verified by comparing simulation results obtained with the 
simulator Spectre to results of measurements. 
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Fig. 10. Database Access to Cadence Design Framework 



5.1 Distance/Size Investigations 



The first set of investigations concerns substrate tie-downs. Fig. 1 1 shows a physical 
cross-section. The applied technology uses a p-doped substrate with o = l/(2Qcm) 
and an n-doped high-resistive epitaxial layer. 



t p+ 


Epitaxial Layer (n ) 


P+ 1 


p 

/ 


/) 


1 p+ 







Substrate (p) 



Fig. 11. Cross-Section for Distance/Size Investigations 



Three rows of pairs of rectangular tie-downs have been designed varying in size dl 
from row to row and in distance d2 within a row. The resistance between the pairs of 
tie-downs was measured with an RLC-meter. The measured results are given in 
Fig. 12 as well as the simulation results by PARCOURS. It shows a good 
correspondence between the measured and simulated results. 
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Fig. 12. Comparison: Distance/Size Investigations 



5.2 Transmitter/Detector Investigations 

Another set of measurements and simulations was performed in order to investigate 
the transmitter-receiver behavior of MOSFETs. The source of substrate noise is 
modeled as a chain of CMOS-inverters. The chain contains six inverters, each built 
with an NMOSFET (W/L = 36/1,2) and a PMOSFET (W/L = 90/1,2). 




Fig. 13. Simulated Results (left) and Measured Results (right) 



As output load a 3 pF capacitor is used. The chain is fed with alternating pulses pro- 
duced by a signal generator. The sensor is an NMOSFET with a W/L-ratio of 36/1,2. 
The sensors source is grounded, the gate is biased at 2 V and its drain is connected to 
the supply voltage of 5 V, shunted by a resistor of 1 kll. Fig. 13 shows the simulated 
voltage at the drain node on the left side and the measured voltage on the right side. 
Once again, the correspondence between simulated and measured results is very good. 
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Peaks reach about 35 mV. The voltage peaks at the substrate node of the sensor 
transistors has a magnitude twice as high than that at the drainnode. The distance of 
all transmitter-receiver constellations was longer than 150 |4. 



6 Experimental Results 

The investigation concerns a complex mixed-signal videochip fabricated in a 0.5 |4m 
BiMOS-technology. It contains several digital blocks, such as clock generators, level 
converters and some other logic circuitry. The analog part contains a line-driver and 
two gap-buffers. This circuit was chosen for investigation because it contains a 
reference voltage source which is threatened by the digital noise. Layout verification 
without substrate crosstalk showed no critical interference. The voltage reference 
value was expected at 1.5 V. Measurements showed that the reference voltage is 
floating around the expected voltage. A substrate crosstalk analysis was performed to 
examine this phenomenon. Simulations with the extended netlist generated by 
PARCOURS show this interference in Fig. 14. The reference signal contains peaks up 
to 0.3 V. A layout verification with PARCOURS would have shown this problem 
before manufacturing the device. 




Fig. 14. Simulated Reference Voltage 



7 Conclusions 

We have presented a new modeling strategy for substrate crosstalk simulation. The 
approach uses the Voronoi Tessellation method which is known to lead to less circuit 
nodes than a uniform grid. Our algorithm uses a set of runlevels which extracts 
substrate parasites with rising accuracy. In the highest runlevel structured models with 
multiple substrate terminals are applied, that are actually written in SpectreHDL. Due 
to the run-level model the tool is both useful for the rough extraction of large digital 
circuits and the detailed extraction of analog or mixed analog/digital circuits. The 
accuracy of the linear parasitic substrate model and some simulations has been 
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verified by measurements. The approach works with Cadence Design Framework 
Databases Dracula and Diva. Therefore the prototype is applicable to industrial 
layouts. Investigations were applied to a commercial videochip. 
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Abstract. This communication shows the influence of clocking schemes on the 
digital switching noise generation. It will be shown how the choice of a suited 
clocking scheme for the digital part reduces the switching noise, thus alleviating 
the problematic associated to limitations of performances in mixed-signal 
Analog/Digital Integrated Circuits. Simulation data of a pipelined XOR chain 
using both a single-phase and a two-phase clocking schemes, as well as of two n- 
bit counters with different clocking styles lead, as conclusions, to recommend 
multiple clock-phase and asynchronous styles for reducing switching noise. 



1 Introduction 

Integration of digital and analog mixed-signal integrated circuits has taken significant 
advantages in the implementation of advanced electronic systems. However, the inte- 
gration of large-scale digital and high-speed analog circuits in the same monolithic IC 
implies interactions, referred to as cross talk, between both parts, and analog signal deg- 
radation problems. In these mixed-signal circuits, the switching noise created by the 
digital circuits passes to the analog circuits, limiting their performances -resolution of 
A/D converters, jitter in PLLs, etc-, and making very difficult the realization of high 
resolution analog circuits on the same substrate with complex digital circuitry. Such 
noise can be easily measured by monitoring the peak value of dynamic current provided 
by the supply source (Fig 1), that is proportional to the carrier injection [1]. 

The use of noise reduction techniques alleviates the influence of switching noise [2]: 
to separate as much as possible the digital and the analog part; to use different supply 
and ground sources for analog and digital circuitry; to considerate the substrate cou- 
pling and reducing it with substrate biasing and using guard-rings, etc. All these meth- 
ods are related to layout and analog design, but do not include digital design methodol- 
ogy- 

Recently, some low-switching-noise digital CMOS families have been reported: 
CSL [3], FSCL [1] and CBL [4]. These current-mode structures work with supply cur- 
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Fig. 1. Dynamic (dominant) and short-circuit current in CMOS. 



rent almost constant, thus reducing variations in supply current and, hence, switching 
noise. However, static power consumption is the main penalty of such structures, mak- 
ing them unsuited for low-power applications. The use of these current-mode families 
is recommended only in risky-noise generation areas, while in other non-critical areas, 
logic should be implemented with more conventional techniques. However, the use of 
these current-mode logics is highly complicated, since these gates are very complex and 
difficult to design and test, they need current-mode to CMOS-conventional interfaces, 
and show static power consumption. Furthermore, additional reduction in switching 
noise implies higher static power consumption [5]. 

This communication explores additional ways of reducing switching noise from the 
digital domain, studying the influence of the clocking style in the digital part on the gen- 
eration of switching noise, when using more conventional low-cost CMOS digital im- 
plementations. 

This communication is divided as follows. Section 2 shows the theoretical influence 
of the clocking scheme in the switching-noise. Section 3 presents a comparison between 
a single-phase and a two-phase scheme as a case of study. Section 4 presents a compar- 
ison between synchronous and asynchronous counters, as example of study. Section 5 
shows some simulation results. And finally. Section 6 presents the conclusions. 



2 Switching Noise and Timing Schemes 

The switching noise, also referred as dl/dt noise, increases when many circuits or blocks 
evaluate simultaneously, causing power supply fluctuations [6]. The use of an specific 
clock strategy when designing the digital part in a mixed-signal IC brings serious con- 
sequences relating to such noise generation. Since the timing scheme indicates the way 
of gates switch, and the supply current is the sum of contributions due to switching 
gates, as the number of synchronized gates switching increases, the peak supply current 
will be also increased. This is the case of Simultaneous Switching Noise (SSN) in buffer 
design [7] [8]. 
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The use of a single-phase clock scheme (fig. 2a) forces that most of the transitions 
in the system take place within a relatively small interval around (during and after) the 
clock active edge. By using two clock phases (fig. 2h), or a douhle-edge clock, switch- 
ing in combinational logic, as well as in clock generator logic and flip-flops, reduces 
the number of gates or subcircuits that simultaneously switch, reducing the peak current 
of supply source. Although the logic blocks can effectively switch at any time between 
consecutive active edges of the clocks considered (depending on the propagation delay 
of combinational logic), the activity i.e., the number of nodes that switch their logic val- 
ue, will be statistically greater in the proximity of active clock edges (dashed area in the 
activity bars in fig. 2). If we consider that both implementations (fig. 2a and fig. 2b) are 
identical in the sense that the same logic is used and the same nodes have the same ca- 
pacitive load associated and hence, the same average current is consumed (see equation 
in fig. 1), the maximum current level will be given in the single-phase clock scheme 
(fig. 2a), since all the flip-flops and logic blocks switch (almost) simultaneously. With 
this reasoning, the most suited synchronous solutions for low-noise generation use 
more than one clock phase, although introducing clock-skew problems, decreasing the 
operation reliability. In such case, a trade-off between low-noise and reliability should 
be found. 

Self-timed [9] design (fig. 2c) is an elegant cost effective means to control noise in 
a predictable manner. By substituting the global clock by locally-generated clocks 
(clockl, clock2 and clock3) indicating the validity of data to be processed for the next 
logic block, switching of gates are unsynchronized, making that supply currents of dif- 
ferent self-clocked blocks do not overlap, hence reducing the magnitude of the noise 
components. In this way, a self-timed circuit can be conceived like a k (large) clock- 
phase system, being the operation distributed in continuous time slots rather than in dis- 
crete time instants. 



3 A Case of Study: Comparison between a Single-Phase 
and a Two-Phase Clock Schemes 

In order to verify the reasoning of Section 2, we are going to measure the switching 
noise in a simple system using two different clocking schemes. The system in a XOR 
gate array of XOR gate with pipeline at a gate level. The flip-flops used in the pipeline 
stage have been designed by using a TSPC approach [10]. The reason of this choice is 
due to the more conventional master-slave flip-flops works in a equivalent two-phase 
configuration, so the comparison would not be fair, as we could confirm without any 
appreciable difference. Also TSPC are widely used in modern VLSI digital design. 

In fig. 3 we show both circuits at a transistor level. In the case of single-phase clock 
scheme, we can distinguish two kinds of TSPC elements: TSPC NMOS, operating in 
the rising edge of the clock, and PMOS, operating in the falling edge. While, in the case 
of two-phase clock scheme, we can only need TSPC NMOS flip-flops. Due to the use 
of the NMOS and PMOS TSPC, the output waveforms will be the same in both cases, 
so the operation form will be identical in both case without decreasing the clock fre- 
quency for the two-phase clocking scheme. 
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a) 
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b) 




Clockl 



Clock! 



Clocks 






Activity bar 

Fig. 2. Different clocking styles for a pipelined logic structure: a) Single-phase, b) 
Two phases, c) Self-timed. The dashed areas in the Activity bar indicate the 
maximum switching density. 
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(b) 

Fig. 3. Schemes at a transistor level corresponding to the array of XOR gates with a) a single- 
phase and h) a two-phase clock schemes. 



4 Another Case of Study: Comparison between the Synchronou 
and the Asynchronous “Ripple” Counter 

Following with the demonstration of the reasoning of Section 2, let us consider a n-bit 
counter as a generic example to show our claim of decreasing spikes in supply current, 
with synchronous and self-timed clocking strategies. The events counter is a sequential 
machine of wide use and interest in most digital and mixed-signal applications, special- 
ly for frequency division applications. The counter device counts events in the C signal, 
increasing or decreasing the count state. Two simple implementations of a 4-bit increas- 
ing counter are shown in fig. 4. Both modular implementations use T(oggle) flip-flops 
as elementary memory units. The synchronous implementation (fig. 4a) uses the C sig- 
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nal as clock of all the flip-flops, while in the ripple implementation (fig. 4h) the clock 
signal of each flip-flop is the output of the previous flip-flop in the counter. As it is 
clear, these are good examples of the different clocking strategies shown in the previous 
section. 

In fig. 5, an HSPICE simulation of a detailed state transition (from 1111 to 0000) is 
shown for both counters. It can be easily seen how the transitions in Qq, Qi, Q 2 and Q 3 
in the synchronous case are almost simultaneous, while in the asynchronous case, the 
transition in Q; provokes the transition in Qj+j, after the propagation delay of the flip- 
flop. The average supply current is approximately the same, but more “concentrated” 
in the synchronous case, meaning a higher maximum value and, hence, provoking 
greater switching noise. 




a) 




Fig. 4. 4-bit counter: a) synchronous, b) asynchronous “ripple”. 



5 Design and Simulation Results [11] 

Simulations have been performed on a 0.7 pm standard technology. The results corre- 
sponding to the comparison between synchronous clock schemes are shown in table 1 , 
while the results for the counters are shown in table 2 . 



Table 1. Simulation results of the synchronous clocking schemes for the pipelined 

XOR array. F=50 MHz. 





Transistors 


Power (mW) 


laverage (pA) 


Ipeak (pA) 




Vdd=5v/3.3v 


Vdd=5v/3.3v 


Vdd=5v/3.3v 


One-Phase 


31 


0.36/0.11 


68.3 / 34.7 


4200 / 1720 


Two-Phase 


31 


0.36/0.08 


68.8/26.4 


2250 / 850 













322 



AJ. Acosta et al. 



-li' 




c - 


4. 




Qo - 






Qi - 






Q2 


■■ - 




Qa - 






^ ivDD “ 


.-r' 


* 


iif. «... 






a) 






C 






Qo 


- 




^ — 



Qi 

Q2 

Qs 

ivDD 

b) 

Fig. 5. Detailed transition from count state 1111 to 0000 in a) synchronous, b) 
asynchronous 4-bit counter. 

Table 2. Simulation results for counters. F=50MHz. PDF: Power-Delay-Product. 





Transistors 


PDP (pj) 
Vdd=5v 


laverage (|lA) 
Vdd=5v/3.3v 


Ipeak (|iA) 
Vdd=5v/3.3v 


4-bit synch. 


116 


0.17 


170 / 100 


4552/2410 


4-bit asynch. 


104 


0.51 


130/70 


1274/666 


8-bit synch. 


244 


0.24 


221 / 125 


9033/4809 


8-bit asynch. 


208 


1.11 


184/81.3 


1421/708 
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In the case of average power consumption, we can see that there is almost any dif- 
ference between both synchronous clocking schemes, being approximately the value 
corresponding to one-phase scheme a 105% of the corresponding to the two-phase one. 
In the case of counters, differences between synchronous and asynchronous are below 
10%. These results can be seen in the fig. 6. 




Fig. 6. Average supply current vs. supply voltage for a) the one-phase and two-phase clocking 

scheme and b) the 4-bit counter. 



Concerning supply current peak, we can see that the peak corresponding to the sin- 
gle-phase is basically twice than the corresponding to the two-phase one, meaning that 
the two-phase scheme presents a better switching-noise behavior. In the case of 
counters, it is much more higher the peak value in supply current for the synchronous 
case (up to 4 times, depending on the Vdd value). These results can be seen in fig. 7. 





Fig. 7. Peak of supply current vs. supply voltage to (a) the one-phase and two-phase clocking 

scheme and (b) a 4-bit counter. 



A clear measurement of the dependence of clocking schemes on peaks of supply 
currents is shown in fig. 8, where timing waveforms and spectra of supply current are 
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depicted. They show how the peak values in time of the synchronous are greater, and 
the harmonics placed in frequencies multiple of the fundamental clock frequency (50 
MHz) are considerably higher (from 4 to 1 1 db). 




Tu i» ■)> i| -.1 ^ -1, t, r f 



C3 ' ■ ' 

(a) (b) 

Fig. 8. Timing waveforms and spectra of supply current for a) synchronous clocking schemes 
and b) the 4-bit counter, Vdd = 5v, f = 50MHz. 

As counters are useful circuits, we have measured as additional parameters in this 
demonstrator the power-delay product. Also, we have performed a comparison with the 
number of stages, what is equivalent to find out the influence of the transistor-count. 
These results are summarized as follows: 

- The power-delay product, corresponding to counters (fig. 9), is better for the syn- 
chronous approach, meaning that better performances can be found, but at the cost of 
extra hardware, one two-input nand gate per bit. 

- The maximum supply current (fig. 10) increases linearly with the counter length 
for the synchronous approach, while the value for the asynchronous one is almost con- 
stant. As the number of stages increases, there are more flip-flops switching simulta- 
neously, increasing the switching noise. 



6 Conclusions 

This communication has shown the influence of the clocking strategy on the switching- 
noise generation. It will be shown how the choice of a suited clocking scheme for the 
digital part, alleviates the problematic associated to switching noise in mixed-signal 
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Fig. 9. Power-delay product vs counter length. 



Asynchronous vs synchronous counter 




Fig. 10. Maximum supply current vs counter length. 



Analog/Digital Integrated Circuits, where better timing and power performances do not 
necessarily imply more suitability for mixed- A/D design. 

We have analyzed and simulated the switching noise generation by comparing the 
peak current results for two different synchronous clocking schemes (one- and two- 
phase clocking). Also, we have compared the results obtained for a synchronous and a 
asynchronous version of a common n-bit counter. Simulation data of different clocking 
styles have lead us to these two statements: a) Additional reduction of switching noise 
when using conventional digital CMOS circuits can be achieved by selecting the clock 
scheme suitably, b) The use of multiple clock-phase and asynchronous styles is strongly 
recommended. Although these solutions can introduce some problems of reliability 
(clock-skew), or complexity (more hardware), these are problems of minor concern in 
mixed-signal design, when comparing to switching noise effects. 
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Abstract. In this paper we present an application of nonlinear sym- 
bolic simplification techniques to analog circuits using Analog Insydes. 
The goal is to get insights into the circuits’ behavior and to generate 
efficient behavioral models. After describing the different simplification 
techniques and the ranking methods we explain how to generate a pin- 
compatible macro model. In an example, the algorithm is applied to a 
nonlinear square root function block. 



1 Introduction 

The behavior of a nonlinear analog circuit can be described by a set of nonlinear 
differential and algebraic equations (DAE system) in symbolic form. This system 
is usually far too complex to be human-interpretable and understandable. To 
get an interpretable symbolic expression describing the circuits’ behavior and 
parameter dependencies it is thus necessary to apply symbolic simplification 
methods to the DAE system. Additionally, the simplification routines can be 
used to generate a macro model which can be simulated more efficiently than the 
original system. In contrast to simplifications by hand the proposed algorithm 
provides error control, i.e., the deviation of the observed input/output behavior 
is assured not to exceed a user given error bound. 

The first version of the algorithm was presented in [3]. Several extensions 
of this algorithm have been developed since then, for example towards multi- 
input/multi-output systems, new simplification methods, or new analysis meth- 
ods. We refer to [9,8,10,13,12] for a description of the enhancements. 

At ITWM the algorithm is being implemented as part of Analog Insydes [7] , 
a Mathematica [14] add-on toolbox for symbolic analysis and approximation of 
analog circuits. 

2 Simplification Techniques 

To obtain a simplified system of equations, several simplifications are applied to 
the system. A simplification can either be an algebraic manipulation or a modifi- 
cation of the equations which results in a new, approximative system. The latter 
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requires a numeric simulation to determine the error caused by the modification. 
Algebraic manipulations are exact operations, thus no error tracking is needed 
here. 

The first group of simplification techniques resides from the observation, that 
some variables of the DAE system do not influence the input-output behavior. 
That includes the elimination of variables and equations and the deletion of 
variables’ time derivatives. In the notions of above, the first one is an algebraic 
manipulation. 

For the second group of simplification techniques, all equations of the DAE 
system are expanded to sum-of-products form, where each part of this expanded 
sum is called a term. The observation, that some terms of a summation contribute 
a very small part to the whole sum and thus can be simplified or even neglected, 
motivates the following modifications on terms: Deletion of terms, substitution 
of terms by constant numeric values, and linearization of terms. 

Each modification step is followed by a numerical error calculation to mea- 
sure the real influence on the input-output behavior due to the modification. 
To calculate the error, a numerical simulation of the system is performed. It 
depends on the given problem which simulation method has to be adopted (DC, 
AC, transient, etc.) - it is even possible to combine different simulations. This 
numerical calculation yields a set of numerical values for all output variables (for 
example, DC points) which have to be combined through an appropriate norm 
to a single error value. Which norm to use (relative norm, maximum norm, 
etc.) again depends on the current problem. In Sect. 5, for example, we apply a 
multiple DC analysis combined with a maximum norm as error calculation. 

Since we are working on multi-output systems, the deviation of each output 
variable has to be taken into account. If the error on one of the output variables 
exceeds the given error bound for this variable, the modification is undone. 



3 Ranking Methods 

The application order of the methods described in Sect. 2 influences the number 
of possible simplifications until the error bound is reached and an optimal order 
depends on the given problem. The implementation of the algorithm in Analog 
Insydes allows to change this order. 

Within one simplification method the number of possible simplifications is 
also influenced by the order in which the simplifications are applied, for example, 
the order of terms in which they are deleted. An optimized order is desirable 
to maximize the number of simplifications and to minimize the number of error 
calculations. An algorithm that predicts the influence of a simplification on the 
output is called ranking method. Ranking algorithms are described in [9] for 
cancellation of terms and in [13] for substitution of terms by constant numeric 
values. 

To handle multi-output systems the ranking algorithm must be able to pre- 
dict the error for each output variable separately. Afterwards these values have 
to be combined with the user given error bound to an overall error prediction. 
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For this assume, that £i , . . . , e„ are the given error bounds for the output vari- 
ables vi, . . . ,Vn and Ai , . . . , A„ are the predicted influences of the modification 
on the output variables. Then one way to compute the overall error prediction 
is given by 



x = -j2~ 

n p- 



n ^ Si 
1—1 



( 1 ) 



This is done for each part of the DAE system giving a list of error predictions. 
Then the parts of the DAE system are processed in the order given by increasing 
error prediction. What is meant by part in this context depends on the simpli- 
fication method: For cancellation of terms, for example, part denotes a single 
summand of the equations, for removing of derivative terms, part denotes a 
summand involving derivatives. Note, that for multi-output systems the ranking 
order depends on the given error bounds (see Eq. (1)). 



4 Model Generation 

Generating a nonlinear model is quite different from the classical 2-port analysis 
technique as described for example in [5] : The parameters of a linear 2-port are 
determined by stimulating one port with an independent source while setting 
the current or voltage of the other port to zero by using a short or open cir- 
cuit. Afterwards the complete 2-port description is set-up by superimposing the 
results of four of these measurements. 

For numerical simulations this technique is suitable, but it fails for nonlinear 
model generation, for which superposition does not hold. Therefore we have to 
determine the complete 2-port description at once, which can be done for linear 
and nonlinear n-ports using symbolic analysis: 

For each port choose a voltage or a current as input - the other one as output. 
For linear n-ports the input and output values are determined by the kind of n- 
port description, e. g., hybrid-parameters; for nonlinear n-ports the output and 
input values are given by the circuit functional behavior. Afterwards stimulate all 
inputs with corresponding independent sources, then set-up the circuit equations 
and eliminate all variables which are not needed to describe the output values. 
The elimination of variables is always possible for a linear circuit. Therefore the 
n-port parameters can be extracted directly of the resulting system of equations. 
For nonlinear circuits it is impossible to eliminate needless variables or to solve 
for the output values explicitly in most cases. Therefore, in general a nonlinear 
model will be an implicit system of equations. 

To use the model in a numerical simulator, this system of nonlinear equations 
has to be converted into a simulator specific model description. In addition, an 
electrical interface has to be provided to gain access to the input and output 
values without disturbing the system of equations. 
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5 Example 

As an example the algorithm is applied to a bipolar square root function circuit 
shown in Fig. 1 [6]. In this example we consider the DC input/output behavior of 




Fig. 1. Schematics of square root function block. 

the circuit, i.e., we treat it as a static system. Thus the underlying DAE system 
here degenerates to a nonlinear equation system without any differential equa- 
tions. The output current lout is proportional to the square root of the input 
current I in. The task is to generate a simplified symbolic formula describing this 
functional dependency and afterwards to create a parametric behavioral model 
as a two-port description of the circuit. 




Fig. 2. Square root function block with stimulus. 

This problem will be solved according to the symbolic analysis work flow 
described in [11]. As stated in Sect. 4 we choose I in as an input value and 
lout as an output value. Therefore, we apply a current source II and a voltage 
source VLOAD as shown in Fig. 2. The value of II is sweeped from 20 /iA to 1 mA, 
the value of VLOAD is varied from OV to 3.5 V. We measure the node voltage V$5 
at node 5 and the current I$VL0AD through the voltage source VLOAD. Figure 3 
shows the result of the simulation of the circuit within Saber [2]. The arrow 
denotes the sweeping of VLOAD. But as it can be seen, the plots for different 
values of VLOAD are identical: Obviously the value of VLOAD has no influence on 
the observed output values. 
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Fig. 3. Saber simulation result of I$VL0AD and V$5 (Saber notation i (v_dc . vload) 
and 5). 



After finishing the numerical reference simulation within Saber, all succeed- 
ing steps including numerical simulations are now performed using Analog Insy- 
des. For this, the Saber netlist is automatically imported into Analog Insydes. 
Additionally, the Saber simulation data is read in as reference for further com- 
parisons. 

Analog Insydes has the ability to switch between different transistor models. 
Applying both the Gummel-Poon and the Ebers-Moll bipolar transistor model 
gives no visible difference to the Saber reference simulation (Figure 4 shows the 
simulation using the Ebers-Moll model). Thus we choose the Ebers-Moll model 
which is much simpler than the Gummel-Poon model - the resulting DAE system 
as shown in Fig. 5 consists of 19 equations with 69 terms instead of 43 equations 
with 143 terms. 



I$VLOAD 




V$5 




0.0002 0.0004 0.0006 0.0008 0.001 



Fig. 4. Saber reference (dashed) and Analog Insydes simulation (solid) of I$VL0AD 
and V$5. 



Once the DAE system is set-up, the nonlinear simplification algorithm can be 
applied. The error of the simplified DAE system will be computed for I$VLDAD 
and V$5 on a discrete grid for II and VLOAD, where the above given sweep 
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IB - I$BC$Q2 + I$VCC + I$VLOAD == 0, 

-IB - I$BC$Q1 + I$BC$Q2 + I$BC$Q3 + I$BE$Q2 + I$BE$Q3 + I$BS$Q2 + I$BS$Q3 
-I$BE$Q3 + I$BE$Q4 + I$BS$Q4 == 0, 

II + I$BC$Q1 + I$BE$Q1 - I$BE$Q2 + I$BS$Q1 == 0, -I$BC$Q3 - I$VLOAD == 0, 

VS3-V$0UT 

+ e IS$Q3 - e IS$Q3 - + I$BC$Q3 — 

BR$Q3 ^ BR$Q3 

IS$Q3 

BF$Q3 ^ 

nil 



= 0 , 



IS$Q3 

BF$Q3 

- IS$Q1 - e- 



• IS$Q3 + I$BE$Q3 == 0, 



- IS$Q1 - 



- IS$Q2 - 



■ IS$Q1 

= 0 , 



BR$Q1 

IS$Q1 v$5 IS$Q1 

IS$Q1 ■ 

BF$Q1 BF$Q1 

I$BS$Q1 == 0, -IS$Q4 + IS$Q4 + I$BC$Q4 = = 

IS$Q4 + IS$Q4 - + i$bE$Q4 

BF$Q4 BF$Q4 ^ 

I$BS$Q4==0, -VCC + V$1==0, 

BR$Q2 BR$Q2 

ISSQ2 - 

BF$Q2 

I$BS$Q2==0, -VLOAD + \ 






I$BE$Q1 == 0, 



IS$Q2 + I$BC$Q2 ! 



Fig. 5. DAE system of the square root function block. 



intervals are uniformly divided into 6 steps. The maximum error is set to an 
absolute deviation of 50 /i A for I$VL0AD and 10 mF for V$5. 

At first the DAE system is simplified algebraically by eliminating variables. 
This reduces the number of equations to 4 with a total number of 40 terms. 
Note, that this is a mathematical exact reduction, no error calculation has to 
be done here. In the next step cancellation of terms is applied as described in 
Sect. 2 up to the error bound given above. Of course, this does not change the 
number of equations, but reduces the total number of terms down to 11. Further 
algebraic elimination finally ends up in a DAE system consisting of 4 equations 
with 8 terms (Fig. 6). Note, that as mentioned above, the output of the original 



f vs 5 ^ v$3 V$4 ^ V$4 

[-IB + IS$Q1 == 0, IS$Q3 + IS$Q4 == 0, 

V$3 V$5 V$3 V$4 1 

II - eTT-TT IS$Q2 == 0, eTT-TTr IS$Q3 - I$VLOAD == Oj 

Fig. 6. Simplified DAE system. 



system does not depend on VLOAD, so the algorithm automatically removes any 
occurrences of VLOAD from the original DAE system. 

The equation system shown in Fig. 6 is an implicit equation system in the 
output variables V$5 and I$VLDAD and the internal variables V$3 and V$4. For- 
tunately, in this example it is possible to eliminate the internal variables and 
to solve the remaining equations explicitly for the output variables. This can be 
achieved using standard Mathematica functions. As result (Fig. 7), two explicit 
symbolic equations are obtained which depend on the input value II, the pa- 
rameters IB and VT, and the saturation current parameters IS$Q1, . . ., IS$Q4 
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Fig. 7 . Explicit solution of the output variables. 



for each transistor. This is exactly the formula stated in [6]. But note, that this 
result was obtained automatically under full error control. Since it is already 
simple enough, no further simplification steps will be applied. The two symbolic 
equations shown in Fig. 7 describe the desired input/output behavior of the cir- 
cuit. Figure 8 shows the comparison of the output of the Saber simulation and 
the simplified system. It can be seen that the error bound is fulfilled. 



I$VLOAD 



0.0003 

0.00025 

0.0002 

0.00015 

0.0001 



0.725r 
0.72 
0.715 i 
0.71 
0.705 




0.0002 0.0004 0.0006 0.0008 0.001 



Fig. 8. Saber reference (dashed) and simulation of simplified DAE system (solid) of 
I$VL0AD and V$5. 




Fig. 9. Behavioral model, replacing the square root function block, with stimulus. 



The last step is to generate a macro model using the simplified set of equa- 
tions (Fig. 7). We choose the branch between node 5 and ground as the input 
port and the branch between node out and ground as the output port (see Fig. 9). 

The Analog Insydes command WriteModel is used to translate the system 
into a Saber MAST [1] template. Afterwards this template is used as a re- 
placement for the square root function block. The numerical simulation result 
computed by Saber can be seen in Fig. 10. 
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Graph2 

(A) : /l_dc.li(-) 




0.0 200u 400u 600u 800u 0.001 0.0012 

/l_dc.ll(-) 



Fig. 10. Saber simulation result of I$VL0AD and V$5 (Saber notation i (v_dc . vload) 
and 5) using the behavioral model. 



Although we used the Saber simulator throughout this example, the appli- 
cation of the algorithm is of course independent of a specific circuit simulator. 

6 Conclusions 

The presented approach extents the simplification techniques of Analog Insydes 
to multi-input/multi-output systems. Starting with a netlist on transistor level 
it is now possible to generate behavioral models automatically in a simulator 
independent way. In an example we showed the application of Analog Insydes 
to a nonlinear square root function block. It was possible to derive a human- 
interpretable parameterized symbolic formula, which - in contrast to calculations 
by hand - assures a user given error bound. Furthermore, we automatically 
generated a Saber MAST template of the simplified formula which can be used 
as a pin-compatible behavioral model replacement for the square root function 
block. 
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